Chapter 12: Quiz

DataField.Dev

Chapter 12: Quiz

Test your understanding of generative models. Answers follow each question.

Question 1

What is the fundamental difference between a discriminative model and a generative model?

Answer

A **discriminative model** learns the conditional distribution $p(y \mid \mathbf{x})$ — given input data, predict the output. A **generative model** learns the data distribution itself, $p(\mathbf{x})$ (or the joint $p(\mathbf{x}, y)$). This distinction matters because generative models can sample new data, evaluate densities (for anomaly detection), and learn latent representations, while discriminative models can only make predictions for given inputs.

Question 2

Write the ELBO decomposition for a VAE and explain what each term does.

Answer

$$\text{ELBO} = \underbrace{\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} [\log p_\theta(\mathbf{x} \mid \mathbf{z})]}_{\text{reconstruction term}} - \underbrace{D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))}_{\text{KL regularizer}}$$ The **reconstruction term** measures how well the decoder reconstructs the input from the latent code — it encourages the encoder to produce informative latent representations. The **KL regularizer** penalizes the encoder for deviating from the prior $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ — it encourages a smooth, structured latent space that can be sampled from. Maximizing the ELBO maximizes a lower bound on the log-likelihood $\log p_\theta(\mathbf{x})$.

Question 3

Why can't we backpropagate through a sampling operation, and how does the reparameterization trick solve this?

Answer

Backpropagation requires computing the gradient of the loss with respect to all parameters. If $\mathbf{z}$ is sampled from $q_\phi(\mathbf{z} \mid \mathbf{x})$, the sampling operation is a stochastic node: there is no deterministic function from parameters to output, so the chain rule cannot be applied directly. The **reparameterization trick** expresses $\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is independent of $\phi$. Now $\mathbf{z}$ is a deterministic, differentiable function of $\phi$ (through $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$), and gradients flow through the multiplication and addition operations.

Question 4

What is the gap between the log-likelihood and the ELBO, and what does it represent?

Answer

$$\log p_\theta(\mathbf{x}) - \text{ELBO} = D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z} \mid \mathbf{x}))$$ The gap is the KL divergence between the variational approximate posterior $q_\phi(\mathbf{z} \mid \mathbf{x})$ and the true posterior $p_\theta(\mathbf{z} \mid \mathbf{x})$. Since KL divergence is non-negative, the ELBO is always a lower bound. The bound is tight when the variational family is expressive enough to match the true posterior exactly. In practice, restricting $q_\phi$ to diagonal Gaussians introduces a permanent gap if the true posterior is non-Gaussian.

Question 5

In the GAN minimax game, what is the optimal discriminator for a fixed generator $G$?

Answer

$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_G(\mathbf{x})}$$ The optimal discriminator outputs the probability that a sample comes from the real data distribution rather than the generator. At equilibrium (when $p_G = p_{\text{data}}$), $D^*(\mathbf{x}) = 1/2$ for all $\mathbf{x}$ — the discriminator cannot distinguish real from fake.

Question 6

Explain mode collapse in GANs. Why does it occur, and what does it look like?

Answer

**Mode collapse** occurs when the generator learns to produce only a small subset of the possible outputs, ignoring the diversity of the real data distribution. It happens because the generator's objective is to fool the discriminator, not to cover all modes: producing one very convincing output type may yield lower loss than producing diverse but less convincing outputs. Visually, mode collapse looks like the generator producing nearly identical outputs regardless of the input noise $\mathbf{z}$. In metrics, it manifests as high precision (generated samples look realistic) but low recall (many real data modes are not represented).

Question 7

What is spectral normalization, and why is it used in GAN training?

Answer

**Spectral normalization** (Miyato et al., 2018) divides each weight matrix $\mathbf{W}$ by its spectral norm (largest singular value $\sigma_1(\mathbf{W})$) after every parameter update: $\mathbf{W} \leftarrow \mathbf{W} / \sigma_1(\mathbf{W})$. This enforces a Lipschitz constraint on the discriminator (each layer has Lipschitz constant $\leq 1$), which stabilizes training by preventing the discriminator from producing arbitrarily large gradients. It is computationally efficient (one power iteration step per layer per update) and has become the default GAN stabilization technique, often replacing the gradient penalty used in WGAN-GP.

Question 8

Write the closed-form expression for the forward process of a DDPM. What does $\bar{\alpha}_t$ represent?

Answer

$$q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \, \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})$$ Equivalently: $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \, \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \, \boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. Here $\bar{\alpha}_t = \prod_{s=1}^{t} (1 - \beta_s)$ is the cumulative product of $(1 - \beta_s)$. It represents the fraction of the original signal preserved at timestep $t$: $\sqrt{\bar{\alpha}_t}$ scales the clean data, and $\sqrt{1 - \bar{\alpha}_t}$ scales the added noise. As $t \to T$, $\bar{\alpha}_t \to 0$ and the signal is completely destroyed.

Question 9

What is the simplified DDPM training objective, and why is it effective?

Answer

$$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2 \right]$$ The model is trained to predict the noise $\boldsymbol{\epsilon}$ that was added to the clean data $\mathbf{x}_0$ to produce the noisy version $\mathbf{x}_t$. This is a simple regression objective — no adversarial dynamics, no posterior collapse, no mode collapse. It is effective because it is equivalent (up to a timestep-dependent weighting) to the full variational lower bound on the log-likelihood, and Ho et al. (2020) showed empirically that the unweighted version works as well or better than the weighted version.

Question 10

How does the noise prediction $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ relate to the score function $\nabla_{\mathbf{x}} \log p(\mathbf{x})$?

Answer

$$\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) = -\frac{\boldsymbol{\epsilon}}{\sqrt{1 - \bar{\alpha}_t}}$$ The noise prediction is proportional to the negative score function: $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \approx -\sqrt{1 - \bar{\alpha}_t} \, \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t)$. This means the diffusion model learns the gradient of the log-density at every noise level. The reverse sampling process is then a form of annealed Langevin dynamics — following the score function from high-noise to low-noise to arrive at the data distribution.

Question 11

What is classifier-free guidance, and how does the guidance scale $w$ affect generated samples?

Answer

**Classifier-free guidance** interpolates between conditional and unconditional noise predictions: $$\hat{\boldsymbol{\epsilon}} = (1 + w) \, \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y) - w \, \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)$$ During training, the condition $y$ is randomly dropped with some probability (e.g., 10%), so the network learns both conditional and unconditional denoising. The guidance scale $w$ controls the quality-diversity tradeoff: $w = 0$ gives standard conditional generation; increasing $w$ sharpens the conditional distribution, producing samples that more closely match the condition at the cost of reduced diversity. Very high $w$ values produce oversaturated, artifact-heavy outputs.

Question 12

Explain the difference between normalizing flows and flow matching. What practical advantage does flow matching offer?

Answer

**Normalizing flows** define an explicit invertible transformation $f_\theta$ between noise and data, requiring specialized architectures (coupling layers, autoregressive transforms) whose Jacobian determinant is efficiently computable. They provide exact log-likelihoods but are constrained by the invertibility requirement. **Flow matching** instead learns a velocity field $\mathbf{v}_\theta(\mathbf{x}_t, t)$ that transports noise to data along simple (often linear) paths. No invertibility constraint is needed — any neural network architecture can serve as the velocity predictor. The practical advantages are: (1) simpler, more flexible architectures; (2) no noise schedule to tune (linear interpolation is parameter-free); (3) straighter transport paths that require fewer ODE integration steps at inference (10-50 vs. 1000 for DDPM); and (4) exact likelihood computation via the continuous change-of-variables formula.

Question 13

True or False: GANs provide a tractable estimate of the data log-likelihood $\log p_\theta(\mathbf{x})$.

Answer

**False.** GANs define an implicit generative model: the generator maps noise to data through a deterministic function, but the density of the generated distribution $p_G(\mathbf{x})$ is never computed. GANs can generate samples from $p_G$ (by passing noise through the generator) but cannot evaluate $p_G(\mathbf{x})$ for a given $\mathbf{x}$. This is a fundamental limitation for applications like anomaly detection or model comparison, where density evaluation is required. VAEs provide an approximate bound (ELBO), and normalizing flows/flow matching provide exact log-likelihoods.

Question 14

Why do VAEs tend to produce blurry reconstructions? What is the primary cause?

Answer

The primary cause is the pixel-wise reconstruction loss (MSE or binary cross-entropy), not the Gaussian latent space as commonly believed. When the output distribution is multimodal (e.g., an edge could be at pixel position 50 or 51), the pixel-wise loss is minimized by predicting the mean of the distribution, which produces a blurred average. The KL regularizer contributes by limiting the information capacity of the latent bottleneck, but replacing the pixel-wise loss with a perceptual loss (distance in feature space of a pretrained network) or an adversarial loss (VAE-GAN) produces much sharper outputs even with the same Gaussian latent space.

Question 15

A trained DDPM with $T = 1000$ timesteps requires 1000 neural network forward passes to generate one sample. Name two techniques that reduce this cost.

Answer

1. **DDIM sampling** (Song et al., 2021): A deterministic sampling procedure that allows skipping timesteps. Because the deterministic ODE has the same marginal distributions as the stochastic process, one can subsample a subset of timesteps (e.g., 50 out of 1000) and still produce high-quality samples. 2. **Distillation / progressive distillation** (Salimans & Ho, 2022): Train a student model to perform 2 denoising steps in 1, then use the student as a teacher to train an even faster student. This progressively halves the number of steps, achieving good quality with as few as 4-8 steps. Other valid answers include: flow matching (which requires fewer steps by construction), latent diffusion (running the diffusion process in a lower-dimensional latent space), and consistency models (Song et al., 2023).

Question 16

In the VAE KL divergence formula $D_{\text{KL}} = -\frac{1}{2} \sum_{j=1}^{d}(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2)$, what happens when $\sigma_j^2 \to 0$ for some dimension $j$?

Answer

When $\sigma_j^2 \to 0$, the KL contribution from dimension $j$ becomes $-\frac{1}{2}(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2) \to -\frac{1}{2}(1 - \infty - \mu_j^2 - 0) = +\infty$. The $\log \sigma_j^2 \to -\infty$ term dominates, making the KL divergence infinite. This means the encoder cannot produce a delta function ($\sigma = 0$) as its posterior — the KL regularizer forces a minimum amount of noise in the latent code. This prevents the VAE from degenerating into a deterministic autoencoder, which is the mathematical mechanism by which the prior enforces a smooth, sampleable latent space.

Question 17

What is the Wasserstein distance, and why does it provide better gradients than the Jensen-Shannon divergence for GAN training?

Answer

The **Wasserstein-1 distance** (earth mover's distance) measures the minimum "work" (mass $\times$ distance) needed to transform one distribution into another: $W(p, q) = \inf_{\gamma \in \Pi(p, q)} \mathbb{E}_{(\mathbf{x}, \mathbf{y}) \sim \gamma}[\|\mathbf{x} - \mathbf{y}\|]$. The JS divergence saturates (becomes $\log 2$, with zero gradient) when the two distributions have disjoint supports — which is almost always the case in high-dimensional spaces early in training. The Wasserstein distance, by contrast, is continuous and differentiable even when distributions do not overlap, because it measures the distance between distributions, not just whether they overlap. This provides a meaningful gradient signal to the generator at all stages of training, enabling smoother convergence.

Question 18

For a StreamRec item VAE with latent dimension $d = 16$, what is the expected value of the KL divergence $D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))$ at the beginning of training when the encoder outputs $\boldsymbol{\mu} \approx \mathbf{0}$ and $\log \boldsymbol{\sigma}^2 \approx \mathbf{0}$?

Answer

When $\boldsymbol{\mu} = \mathbf{0}$ and $\log \boldsymbol{\sigma}^2 = \mathbf{0}$ (i.e., $\boldsymbol{\sigma}^2 = \mathbf{1}$), each dimension's contribution to the KL is: $$-\frac{1}{2}(1 + \log 1 - 0 - 1) = -\frac{1}{2}(1 + 0 - 0 - 1) = 0$$ The total KL divergence is $\sum_{j=1}^{16} 0 = 0$. This is expected: when the encoder outputs parameters matching the prior $\mathcal{N}(\mathbf{0}, \mathbf{I})$, the KL divergence is zero. In practice, neural network initialization produces near-zero outputs, so the KL starts near zero and increases as the encoder learns to encode meaningful information into the latent space.

Question 19

Name three practical applications of generative models beyond image generation.

Answer

1. **Synthetic data generation**: Generating realistic tabular data (e.g., electronic health records, financial transactions) that preserves statistical properties without containing real individual records — enabling data sharing, model development, and testing under privacy constraints. 2. **Data augmentation**: Generating synthetic minority-class examples to address class imbalance in classification problems, or creating additional training examples when real data is scarce or expensive to collect. 3. **Anomaly detection**: Training a generative model on normal data and using reconstruction error (VAE), ELBO, or denoising loss (diffusion) to flag inputs that are out-of-distribution — manufacturing defect detection, fraud detection, or network intrusion detection. Other valid answers: missing data imputation, drug/molecule design, stochastic weather simulation, music/audio generation, text generation.

Question 20

A colleague claims: "Diffusion models are strictly better than VAEs and GANs, so we should always use them." Provide three scenarios where this claim is wrong.

Answer

1. **When you need fast sampling.** A VAE generates a sample in one forward pass (~1ms). A GAN also generates in one pass. A diffusion model requires 50-1000 forward passes per sample. For real-time applications (interactive recommendation, game asset generation), VAEs or GANs are preferable. 2. **When you need a structured latent space.** VAEs learn a continuous, smooth latent space that can be navigated, interpolated, and used for downstream tasks (representation learning, clustering, retrieval). Diffusion models have no latent space — they operate directly in data space. If the application requires understanding the structure of the data manifold, a VAE is the right tool. 3. **When training compute is limited.** Diffusion models are the most computationally expensive to train: they require many epochs on large datasets with large U-Net architectures. For small datasets, limited GPU budgets, or rapid prototyping, a VAE trains in a fraction of the time and provides a reasonable generative model. The "best" model is the one you can actually train and iterate on within your resource constraints.