Chapter 12: Exercises
Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field
VAE Fundamentals
Exercise 12.1 (*)
Consider a VAE with a 2-dimensional latent space and a Gaussian encoder $q_\phi(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$.
(a) For a single data point with encoder outputs $\boldsymbol{\mu} = [0.5, -1.0]$ and $\log \boldsymbol{\sigma}^2 = [-0.2, 0.8]$, compute the KL divergence $D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))$ using the closed-form expression from Section 12.3. Show your work.
(b) What happens to the KL term when $\boldsymbol{\mu} = \mathbf{0}$ and $\boldsymbol{\sigma}^2 = \mathbf{1}$ (i.e., the encoder exactly matches the prior)? Verify algebraically.
(c) Which contributes more to the KL divergence in part (a): the mean term ($\mu_j^2$) or the variance term ($\sigma_j^2 - \log \sigma_j^2 - 1$)? What does this tell you about the relative importance of centering vs. scaling the posterior?
Exercise 12.2 (*)
The reparameterization trick writes $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.
(a) Draw the computational graph for the VAE forward pass, labeling the stochastic and deterministic nodes. Show where gradients flow under the reparameterization trick.
(b) Explain why the naive approach — sampling $\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2 \mathbf{I})$ directly — prevents backpropagation through the encoder. What is the mathematical reason that $\nabla_\phi \mathbb{E}_{q_\phi}[f(\mathbf{z})]$ is not computable by backpropagation without the reparameterization?
(c) The reparameterization trick works for Gaussian distributions. Name two distributions for which a similar reparameterization exists and one distribution for which it does not. What alternative gradient estimation technique is used in that case?
Exercise 12.3 (*)
Train the VAE from Section 12.3 on a synthetic dataset of 2D points drawn from a mixture of 5 Gaussians:
import numpy as np
def make_gaussian_mixture(n_samples: int = 10000, seed: int = 42) -> np.ndarray:
rng = np.random.RandomState(seed)
centers = np.array([[2, 2], [-2, 2], [0, -2], [3, -1], [-3, -1]])
stds = np.array([0.3, 0.4, 0.35, 0.25, 0.45])
labels = rng.randint(0, 5, size=n_samples)
data = centers[labels] + stds[labels, None] * rng.randn(n_samples, 2)
return data.astype(np.float32)
(a) Set input_dim=2, hidden_dim=128, latent_dim=2. Train for 100 epochs with $\beta = 1.0$. Plot the original data and 1000 generated samples side by side. Does the VAE capture all 5 modes?
(b) Repeat with $\beta = 0.1$ and $\beta = 5.0$. How does $\beta$ affect mode coverage and sample quality?
(c) Encode all training points and color them by their true cluster assignment. Does the latent space show cluster structure? Overlay the prior $\mathcal{N}(\mathbf{0}, \mathbf{I})$ as a contour plot.
Exercise 12.4 (**)
Posterior collapse occurs when the KL term drives $q_\phi(\mathbf{z} \mid \mathbf{x})$ to match the prior for all $\mathbf{x}$, causing the decoder to ignore $\mathbf{z}$.
(a) Modify the VAE training code to track the KL divergence per latent dimension across training epochs. Which dimensions collapse first? Why?
(b) Implement KL annealing: linearly increase $\beta$ from 0 to 1 over the first 30 epochs. Compare the final KL divergence per dimension and reconstruction loss to the standard VAE. Does annealing prevent collapse?
(c) Implement free bits (Kingma et al., 2016): replace the per-dimension KL with $\max(\lambda, D_{\text{KL}}^{(j)})$ where $\lambda = 0.25$ nats. How does this compare to KL annealing?
Exercise 12.5 (**)
Derive the ELBO starting from a different route than Jensen's inequality.
(a) Start with $\log p_\theta(\mathbf{x}) = \log p_\theta(\mathbf{x}) \int q_\phi(\mathbf{z} \mid \mathbf{x}) \, d\mathbf{z}$ (since the integral equals 1). Show that:
$$\log p_\theta(\mathbf{x}) = \text{ELBO} + D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z} \mid \mathbf{x}))$$
(b) Why does this alternative derivation make the tightness of the ELBO more transparent than the Jensen's inequality derivation?
(c) What does this imply about the behavior of the ELBO when the variational family $q_\phi$ is too restrictive (e.g., diagonal Gaussian when the true posterior is multimodal)?
GAN Fundamentals
Exercise 12.6 (*)
For a fixed generator $G$, show that the optimal discriminator is:
$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_G(\mathbf{x})}$$
(a) Start from the GAN objective $V(G, D)$ and differentiate with respect to $D(\mathbf{x})$ at each point $\mathbf{x}$. Show that the integrand is maximized by $D^*(\mathbf{x})$.
(b) Substitute $D^*$ back into $V(G, D^*)$ and show that the result involves the Jensen-Shannon divergence between $p_{\text{data}}$ and $p_G$.
(c) What is the value of $V(G, D^*)$ when $p_G = p_{\text{data}}$? What does the discriminator output at equilibrium?
Exercise 12.7 (*)
Mode collapse is a common GAN failure mode.
(a) Train the GAN from Section 12.4 on the 5-Gaussian mixture from Exercise 12.3. Generate 1000 samples and plot them. How many of the 5 modes does the generator capture?
(b) Monitor $D(\text{real})$ and $D(\text{fake})$ during training. Describe what you observe during mode collapse: what do these values look like when the generator cycles between modes?
(c) Increase n_critic to 5 (train the discriminator 5 times per generator update). Does this help with mode coverage? Why or why not?
Exercise 12.8 (**)
Implement a Wasserstein GAN with gradient penalty (WGAN-GP) for the 5-Gaussian mixture.
(a) Remove the sigmoid from the discriminator (it becomes a critic). Replace the binary cross-entropy losses with the Wasserstein loss: - Critic loss: $\mathbb{E}[D(\text{fake})] - \mathbb{E}[D(\text{real})]$ (critic minimizes this). - Generator loss: $-\mathbb{E}[D(\text{fake})]$.
(b) Implement the gradient penalty: for each batch, sample $\hat{\mathbf{x}} = \epsilon \mathbf{x}_{\text{real}} + (1 - \epsilon) \mathbf{x}_{\text{fake}}$ with $\epsilon \sim U(0, 1)$, compute $\nabla_{\hat{\mathbf{x}}} D(\hat{\mathbf{x}})$, and add $\lambda (\|\nabla_{\hat{\mathbf{x}}} D(\hat{\mathbf{x}})\|_2 - 1)^2$ to the critic loss with $\lambda = 10$.
(c) Compare mode coverage between the standard GAN and WGAN-GP on the 5-Gaussian mixture. Generate 2000 samples from each and count how many modes are covered (a mode is "covered" if at least 50 generated samples fall within 2 standard deviations of its center).
Exercise 12.9 (**)
(a) Explain why the original GAN loss $\log(1 - D(G(\mathbf{z})))$ saturates (provides vanishing gradients) when the discriminator is strong.
(b) Show that the non-saturating loss $-\log D(G(\mathbf{z}))$ provides stronger gradients when $D(G(\mathbf{z})) \approx 0$. Compute the gradient $\frac{\partial}{\partial G}$ for both losses when $D(G(\mathbf{z})) = 0.01$.
(c) Despite stronger gradients, the non-saturating loss does not change the equilibrium. Prove that both losses have the same fixed point (the same generator is optimal for both).
Diffusion Models
Exercise 12.10 (*)
For a linear noise schedule with $\beta_1 = 10^{-4}$ and $\beta_T = 0.02$ with $T = 1000$:
(a) Compute $\bar{\alpha}_t$ for $t \in \{1, 100, 250, 500, 750, 1000\}$. What fraction of the original signal remains at each timestep?
(b) For a 1D data point $x_0 = 3.0$, compute the mean and standard deviation of $q(\mathbf{x}_t \mid x_0)$ at each timestep in part (a). At which timestep is the signal-to-noise ratio approximately 1?
(c) Plot $\sqrt{\bar{\alpha}_t}$ (signal coefficient) and $\sqrt{1 - \bar{\alpha}_t}$ (noise coefficient) as functions of $t$. Describe the shape of these curves and explain why the noise schedule is chosen to be linear in $\beta_t$ rather than linear in $\bar{\alpha}_t$.
Exercise 12.11 (*)
Implement and train the SimpleDDPM from Section 12.5 on the 5-Gaussian mixture from Exercise 12.3.
(a) Set input_dim=2, hidden_dim=256, n_timesteps=500. Train for 200 epochs. Generate 1000 samples and plot them alongside the real data.
(b) Visualize the denoising process: sample $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, then plot $\mathbf{x}_t$ at $t \in \{500, 400, 300, 200, 100, 50, 10, 0\}$ for 100 samples. You should see the samples gradually coalesce into the 5-mode structure.
(c) Compare mode coverage between the DDPM and the GAN from Exercise 12.7. Which captures more modes? Is this consistent with the theoretical expectations from Section 12.7?
Exercise 12.12 (**)
Noise schedule design significantly impacts diffusion model performance.
(a) Implement a cosine noise schedule (Nichol & Dhariwal, 2021):
$$\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2$$
with $s = 0.008$.
(b) Plot $\bar{\alpha}_t$ for both linear and cosine schedules. Where do they differ most? Why does the cosine schedule spend more "noise budget" in the middle timesteps?
(c) Train two DDPMs — one with each schedule — on the same dataset. Compare the loss curves and sample quality. Which produces better results, and why?
Exercise 12.13 (***)
Derive the simplified DDPM loss from the variational lower bound.
(a) Start from the ELBO for the full diffusion model:
$$\log p_\theta(\mathbf{x}_0) \geq \mathbb{E}_q \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)} \right]$$
Show that this decomposes into:
$$\mathcal{L} = L_0 + \sum_{t=2}^{T} L_{t-1} + L_T$$
where $L_{t-1} = D_{\text{KL}}(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) \| p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t))$.
(b) Show that $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ is Gaussian with mean:
$$\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t$$
(c) Substituting $\mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon})$ and matching the parameterization $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ from Section 12.5, show that minimizing each $L_{t-1}$ is equivalent (up to a constant) to:
$$\mathbb{E}_{\boldsymbol{\epsilon}} \left[ \|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2 \right]$$
(d) Explain why dropping the timestep-dependent weighting (the "simplified" loss) works in practice despite not being the exact variational bound.
Exercise 12.14 (**)
Implement conditional DDPM generation for the 5-Gaussian mixture.
(a) Modify the DenoisingMLP to accept a class label $y \in \{0, 1, 2, 3, 4\}$ as additional input (embed it using nn.Embedding and concatenate with the timestep embedding).
(b) Implement classifier-free guidance: during training, drop the class label (replace with a "null" label, e.g., $y = 5$) with probability 0.1. At inference, compute the guided noise prediction: $\hat{\boldsymbol{\epsilon}} = (1 + w) \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, y) - w \, \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)$.
(c) Generate 200 samples for each class with guidance scales $w \in \{0, 1, 3, 7\}$. Plot the results and describe how the guidance scale affects the concentration of samples around the true mode center.
Flow Matching
Exercise 12.15 (**)
(a) Implement the FlowMatchingTrainer from Section 12.6 and train it on the 5-Gaussian mixture. Generate 1000 samples with n_steps=50 and compare to the DDPM results.
(b) Vary n_steps in $\{5, 10, 25, 50, 100, 200\}$ and plot the FID-like metric (or visual quality) as a function of steps. At what point do additional steps provide diminishing returns?
(c) Replace the Euler integrator with a 4th-order Runge-Kutta (RK4) solver. Does RK4 achieve the same quality as Euler with fewer steps?
Exercise 12.16 (***)
Optimal transport flow matching replaces the random coupling (matching each data point with a random noise sample) with the optimal transport coupling.
(a) For a batch of data points $\{\mathbf{x}_1^{(i)}\}$ and noise samples $\{\boldsymbol{\epsilon}^{(i)}\}$, compute the optimal transport assignment by solving the linear assignment problem (use scipy.optimize.linear_sum_assignment with cost matrix $C_{ij} = \|\mathbf{x}_1^{(i)} - \boldsymbol{\epsilon}^{(j)}\|^2$).
(b) Train a flow matching model with OT coupling on the 5-Gaussian mixture. Compare the learned velocity field to the random coupling version. Does OT produce straighter paths?
(c) How does OT coupling affect the number of integration steps needed for good sample quality?
Comparison and Applications
Exercise 12.17 (**)
Build a systematic comparison of all four generative families on the 5-Gaussian mixture.
(a) Train a VAE ($\beta = 1.0$), GAN, DDPM ($T = 500$), and flow matching model on the same dataset. Generate 2000 samples from each.
(b) Evaluate each model on: - Mode coverage (fraction of 5 modes captured) - Sample quality (mean distance from each sample to its nearest real data point) - Training time (wall-clock seconds) - Sampling time (wall-clock seconds for 2000 samples)
(c) Present results in a table and discuss which model you would recommend for this task and why.
Exercise 12.18 (**)
Anomaly detection with VAEs. Train a VAE on the first 4 modes of the 5-Gaussian mixture (exclude mode 5). Then:
(a) Compute the negative ELBO (reconstruction loss + KL) for each point in a test set that includes all 5 modes. Plot the ELBO as a function of the data point's true cluster.
(b) Set a threshold on the negative ELBO to classify points as normal vs. anomalous. What threshold maximizes the F1 score for detecting mode-5 points?
(c) How does the ELBO-based anomaly score compare to a simpler baseline: the Mahalanobis distance to the nearest training cluster?
Exercise 12.19 (**)
Synthetic data utility. Using the Item VAE from Section 12.8:
(a) Generate 20,000 synthetic items. Train a logistic regression classifier (category prediction) on the synthetic data and evaluate on real test data. Compare to training on real data. What is the utility gap?
(b) Try a mixed training set: 50% real, 50% synthetic. Does this improve performance over real data alone?
(c) Generate synthetic items conditioned on a specific category (by fixing the first $n_{\text{categories}}$ dimensions of the decoder input). How does category-specific generation quality compare to unconditional generation?
Exercise 12.20 (***)
Implement a VAE with a learned prior (VampPrior, Tomczak & Welling, 2018).
(a) Instead of $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$, define the prior as a mixture:
$$p_\lambda(\mathbf{z}) = \frac{1}{K} \sum_{k=1}^{K} q_\phi(\mathbf{z} \mid \mathbf{u}_k)$$
where $\{\mathbf{u}_1, \ldots, \mathbf{u}_K\}$ are $K$ learnable pseudo-inputs in the data space.
(b) The KL divergence no longer has a closed form. Implement a Monte Carlo estimate of $D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\lambda(\mathbf{z}))$ using the log-sum-exp trick for numerical stability.
(c) Train the VampPrior VAE on the 5-Gaussian mixture with $K = 10$. Does it learn a prior that captures the multimodal structure? Compare generated samples to the standard Gaussian prior VAE.
Exercise 12.21 (***)
Denoising score matching provides an alternative derivation of the diffusion objective.
(a) The score function is $\nabla_\mathbf{x} \log p(\mathbf{x})$. Show that for the noisy distribution $q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})$:
$$\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t \mid \mathbf{x}_0) = -\frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0}{1 - \bar{\alpha}_t} = -\frac{\boldsymbol{\epsilon}}{\sqrt{1 - \bar{\alpha}_t}}$$
(b) The denoising score matching objective is:
$$\mathcal{L}_{\text{DSM}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \|\mathbf{s}_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t \mid \mathbf{x}_0)\|^2 \right]$$
Show that if we parameterize $\mathbf{s}_\theta(\mathbf{x}_t, t) = -\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) / \sqrt{1 - \bar{\alpha}_t}$, then minimizing $\mathcal{L}_{\text{DSM}}$ is equivalent to minimizing $\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2$.
(c) What is the advantage of the score-matching perspective over the variational bound perspective for extending diffusion models to continuous time?
Exercise 12.22 (***)
Implement DDIM sampling (Song et al., 2021) as an alternative to the stochastic DDPM sampler.
(a) The DDIM update rule is:
$$\mathbf{x}_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \left( \frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{\bar{\alpha}_t}} \right) + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \, \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) + \sigma_t \boldsymbol{\epsilon}$$
Setting $\sigma_t = 0$ makes sampling deterministic. Implement this in the SimpleDDPM class as an alternative sample_ddim method.
(b) With $\sigma_t = 0$ and $T = 1000$ training steps, sample using only $S \in \{10, 25, 50, 100, 250\}$ evenly spaced steps. Compare sample quality to the full 1000-step DDPM sampler.
(c) Why does deterministic sampling (DDIM) allow skipping steps while stochastic sampling (DDPM) does not? What property of the deterministic ODE makes this possible?
Exercise 12.23 (***)
Latent diffusion. Instead of running diffusion in pixel/data space, encode data into a latent space first, then run diffusion there.
(a) Train a VAE on the 5-Gaussian mixture to learn a 2D latent space. Freeze the VAE.
(b) Encode the training data into the latent space. Train a DDPM on the latent encodings (not the original data).
(c) To generate: sample from the DDPM in latent space, then decode with the VAE decoder. Compare the results to a DDPM trained directly on the data. What are the tradeoffs?
Exercise 12.24 (****)
Theoretical analysis of mode collapse in GANs.
(a) Consider a 1D data distribution that is a mixture of two point masses: $p_{\text{data}} = 0.5 \delta(x - 1) + 0.5 \delta(x + 1)$. The generator is $G_\theta(z) = \theta$ (a constant, ignoring the noise input). Show that the optimal discriminator for this generator assigns $D^*(1) = \frac{0.5}{0.5 + 0}$ (assuming $\theta \neq 1$). What happens when $\theta$ oscillates between $+1$ and $-1$?
(b) Now let $G_\theta(z) = \theta_1 z + \theta_2$ with $z \sim \mathcal{N}(0, 1)$. The generator can produce a Gaussian with mean $\theta_2$ and standard deviation $|\theta_1|$. Can this generator represent the true data distribution? Why does the minimax game nevertheless not converge to it?
(c) Discuss how the Wasserstein distance resolves (or fails to resolve) this problem. Under what conditions does the WGAN objective have a unique equilibrium?
Exercise 12.25 (****)
Privacy guarantees for synthetic data. In the MediCore application (Case Study 1), synthetic patient records must not leak information about real patients.
(a) Define the nearest-neighbor distance ratio (NNDR): for each synthetic record $\mathbf{x}_s$, compute the ratio of its distance to the nearest real record to the distance between the two nearest real records in its neighborhood. Implement this metric.
(b) A synthetic record is a "privacy risk" if its NNDR < 0.5 (it is closer to a real record than real records are to each other). Train VAEs with different latent dimensions ($d \in \{4, 8, 16, 32, 64\}$) on the item features from Section 12.8 and compute the fraction of privacy-risk synthetic records for each. How does latent dimension affect privacy?
(c) Can a generative model provide formal differential privacy guarantees? Discuss the relationship between DP-SGD (differentially private stochastic gradient descent) and the privacy properties of a model's outputs. What is the cost in terms of sample quality?
Exercise 12.26 (**)
Interpolation quality. Using the trained Item VAE from Section 12.8:
(a) Select 5 pairs of items from different categories. For each pair, encode both items to get $\mathbf{z}_a$ and $\mathbf{z}_b$, then decode 9 points along the linear interpolation $\mathbf{z}_\alpha = (1 - \alpha) \mathbf{z}_a + \alpha \mathbf{z}_b$ for $\alpha \in \{0, 0.125, 0.25, \ldots, 1.0\}$.
(b) For each interpolated point, determine the predicted category (argmax of the first 20 decoded dimensions). At what $\alpha$ value does the category switch? Is the transition sharp or gradual?
(c) Compare linear interpolation in the latent space to spherical linear interpolation (slerp): $\mathbf{z}_\alpha = \frac{\sin((1-\alpha)\omega)}{\sin \omega} \mathbf{z}_a + \frac{\sin(\alpha \omega)}{\sin \omega} \mathbf{z}_b$ where $\omega = \arccos(\hat{\mathbf{z}}_a \cdot \hat{\mathbf{z}}_b)$. Does slerp produce smoother transitions?
Exercise 12.27 (***)
Conditional generation for data augmentation. Suppose 5% of StreamRec items belong to a rare category (category 19).
(a) Train the Item VAE from Section 12.8 on the full dataset. Generate 5000 synthetic items by sampling $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and decoding. What fraction belong to category 19? Is this close to 5%?
(b) To oversample the rare class, encode all category-19 items, compute the mean and covariance of their latent codes, and sample new latent codes from $\mathcal{N}(\boldsymbol{\mu}_{19}, \boldsymbol{\Sigma}_{19})$. Decode these to generate synthetic category-19 items. Evaluate their quality (do they look like real category-19 items in terms of numeric features?).
(c) Train a category classifier on: (i) the original imbalanced dataset, (ii) the original dataset plus 2000 synthetic category-19 items from part (b). Compare the category-19 precision, recall, and F1 scores. Does data augmentation help?
Exercise 12.28 (****)
Connecting generative models to information theory (links to Chapter 4).
(a) The ELBO can be written as $\text{ELBO} = -H(q_\phi(\mathbf{z} \mid \mathbf{x})) - \mathbb{E}_{q_\phi}[\log p(\mathbf{z})] + \mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x} \mid \mathbf{z})]$. Identify each term as a quantity from information theory (entropy, cross-entropy, or mutual information).
(b) The rate of a VAE is $R = D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))$ — the number of nats transmitted through the latent bottleneck. The distortion is $D = -\mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x} \mid \mathbf{z})]$ — the reconstruction error. The ELBO traces a rate-distortion curve as $\beta$ varies. Plot the $(R, D)$ curve for $\beta \in \{0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0\}$ on the 5-Gaussian mixture.
(c) What is the theoretical minimum distortion at infinite rate? What is the theoretical minimum rate at infinite distortion? How do these extremes correspond to a deterministic autoencoder and a model that ignores the input?