Chapter 16: Exercises

Part A: Conceptual Foundations

Exercise 16.1: Autoencoder vs. PCA

A linear autoencoder with encoder $\mathbf{z} = \mathbf{W}_e \mathbf{x}$ and decoder $\hat{\mathbf{x}} = \mathbf{W}_d \mathbf{z}$ is trained with MSE loss. Assume $\mathbf{x} \in \mathbb{R}^{100}$ and $\mathbf{z} \in \mathbb{R}^{10}$.

a) Show that the optimal $\mathbf{W}_e$ spans the same subspace as the top 10 principal components of the data. b) Explain why the learned basis vectors may differ from the principal components by a rotation. c) What happens if we add ReLU activations to the encoder? Does the equivalence to PCA still hold?

Exercise 16.2: Bottleneck Capacity

An undercomplete autoencoder for $28 \times 28$ grayscale images uses latent dimensions $k \in \{2, 8, 32, 128, 512\}$.

a) Calculate the compression ratio for each value of $k$. b) For which values of $k$ would you expect near-perfect reconstruction on MNIST? Why? c) At what point does the autoencoder risk learning the identity function? What factors besides $k$ determine this risk?

Exercise 16.3: Sparse Autoencoder Analysis

A sparse autoencoder uses L1 regularization with $\lambda = 0.001$ and has a latent dimension of $k = 256$ (overcomplete for a 784-dimensional input).

a) Why doesn't an overcomplete autoencoder trivially learn the identity function when L1 sparsity is applied? b) If the average sparsity (fraction of nonzero activations per input) is 15%, how many latent dimensions are active on average? c) Compare the L1 penalty to the KL divergence sparsity penalty. What are the trade-offs?

Exercise 16.4: Denoising Autoencoder Theory

A denoising autoencoder is trained with masking noise (probability $p = 0.5$ of zeroing each input dimension).

a) Explain intuitively why denoising forces the autoencoder to learn correlations between input dimensions. b) If $p = 0$ (no corruption), what does the DAE reduce to? c) If $p = 1$ (all inputs zeroed), can the DAE learn anything useful? Why or why not? d) Vincent et al. (2010) showed that a DAE with small Gaussian noise learns to estimate $\nabla_\mathbf{x} \log p(\mathbf{x})$. Why is this useful for generation?

Exercise 16.5: ELBO Derivation

Derive the Evidence Lower Bound starting from the log-marginal likelihood $\log p_\theta(\mathbf{x})$ using the identity:

$$\log p_\theta(\mathbf{x}) = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})} \left[\log p_\theta(\mathbf{x})\right]$$

a) Show all steps of the derivation, identifying where the KL divergence appears. b) Prove that the gap between $\log p_\theta(\mathbf{x})$ and the ELBO equals $D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$. c) Under what condition is the ELBO tight (i.e., equals $\log p_\theta(\mathbf{x})$)?

Exercise 16.6: KL Divergence Computation

For $q = \mathcal{N}(\mu, \sigma^2)$ and $p = \mathcal{N}(0, 1)$ (both univariate):

a) Derive the closed-form expression $D_{\text{KL}}(q \| p) = -\frac{1}{2}(1 + \log \sigma^2 - \mu^2 - \sigma^2)$. b) Compute $D_{\text{KL}}$ for $\mu = 0.5, \sigma = 0.8$. c) Compute $D_{\text{KL}}$ for $\mu = 0, \sigma = 1$. Verify it equals zero. d) What happens to $D_{\text{KL}}$ as $\sigma \to 0$? Interpret this geometrically.

Part B: Mathematical Analysis

Exercise 16.7: Reparameterization Trick

Consider the gradient of the reconstruction term with respect to encoder parameters $\phi$:

$$\nabla_\phi \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})} [\log p_\theta(\mathbf{x}|\mathbf{z})]$$

a) Explain why the naive Monte Carlo estimate $\nabla_\phi \log p_\theta(\mathbf{x}|\mathbf{z}^{(l)})$ with $\mathbf{z}^{(l)} \sim q_\phi(\mathbf{z}|\mathbf{x})$ is incorrect. b) Write the reparameterized form $\mathbf{z} = \mu + \sigma \odot \epsilon$ and show how gradients with respect to $\mu$ and $\sigma$ are computed. c) Why does the reparameterization trick not work for discrete latent variables?

Exercise 16.8: Beta-VAE Trade-off

A $\beta$-VAE uses the loss $\mathcal{L} = -\mathbb{E}[\log p_\theta(\mathbf{x}|\mathbf{z})] + \beta \cdot D_{\text{KL}}(q_\phi \| p)$.

a) What does $\beta = 0$ reduce to? b) What does $\beta \to \infty$ force the model to do? c) Explain intuitively why $\beta > 1$ encourages disentanglement. d) Plot (sketch) the expected reconstruction quality and disentanglement as functions of $\beta$.

Exercise 16.9: Information-Theoretic View

The ELBO can be rewritten as:

$$\text{ELBO} = \mathbb{E}_{p_{\text{data}}(\mathbf{x})}[\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$$

a) Show that the KL term can be decomposed as: $\mathbb{E}_{p_{\text{data}}}[D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))] = I_q(\mathbf{x}; \mathbf{z}) + D_{\text{KL}}(q_\phi(\mathbf{z}) \| p(\mathbf{z}))$ where $I_q$ is the mutual information under $q$ and $q_\phi(\mathbf{z}) = \mathbb{E}_{p_{\text{data}}}[q_\phi(\mathbf{z}|\mathbf{x})]$ is the aggregate posterior. b) Interpret each term. Which term encourages informative codes? Which encourages matching the prior marginally?

Exercise 16.10: NT-Xent Loss Analysis

In SimCLR, the NT-Xent loss for a positive pair $(i, j)$ with batch size $N$ is:

$$\ell_{i,j} = -\log \frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j)/\tau)}{\sum_{k \neq i} \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k)/\tau)}$$

a) What is the minimum value of this loss? When is it achieved? b) How many negative pairs does each anchor have? Express in terms of $N$. c) What happens as $\tau \to 0$? As $\tau \to \infty$? d) Why does larger batch size improve SimCLR performance?

Exercise 16.11: BYOL Stability

BYOL uses an exponential moving average (EMA) update: $\xi \leftarrow m\xi + (1-m)\theta$ with $m = 0.996$.

a) After 100 updates, what fraction of the initial target parameters $\xi_0$ remains? Compute $(0.996)^{100}$. b) Why is the EMA update important for preventing collapse? c) What would happen if $m = 0$ (target network always equals online network)? d) What would happen if $m = 1$ (target network never updates)?

Exercise 16.12: Posterior Collapse Diagnosis

You train a VAE and observe that the KL divergence term is nearly zero throughout training.

a) What does this imply about the encoder's output $q_\phi(\mathbf{z}|\mathbf{x})$? b) What does this imply about the generated samples? c) List three techniques to prevent posterior collapse and explain the mechanism of each.

Part C: Coding Exercises

Exercise 16.13: Implement an Undercomplete Autoencoder

Implement a fully connected autoencoder with architecture 784-256-128-32-128-256-784 for MNIST.

a) Train with MSE loss and Adam optimizer for 20 epochs. b) Visualize 10 original and reconstructed images side by side. c) Plot the latent space using t-SNE, colored by digit class. d) Report the final reconstruction loss on the test set.

Exercise 16.14: Sparse Autoencoder Comparison

Extend Exercise 16.13 to implement a sparse autoencoder with $k = 256$ (overcomplete).

a) Implement both L1 and KL divergence sparsity penalties. b) Train both variants and compare the sparsity of the learned codes (histogram of activations). c) Measure downstream classification accuracy using a linear classifier on the latent codes. d) Experiment with different values of $\lambda \in \{0.0001, 0.001, 0.01, 0.1\}$ and report the trade-off between reconstruction loss and sparsity.

Exercise 16.15: Denoising Autoencoder

Implement a denoising autoencoder that handles three corruption types: Gaussian noise, masking noise, and salt-and-pepper noise.

a) Train separate models for each corruption type on MNIST. b) Visualize: original, corrupted, and reconstructed images for each. c) Evaluate each model on data corrupted with a different noise type than it was trained on. What do you observe about generalization?

Exercise 16.16: VAE Implementation

Implement a convolutional VAE for MNIST with a 20-dimensional latent space.

a) Implement the reparameterization trick. b) Implement the ELBO loss (reconstruction + KL divergence). c) Train for 30 epochs and plot the reconstruction loss and KL divergence separately over training. d) Generate 100 random samples by sampling $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and decoding.

Exercise 16.17: Latent Space Exploration

Using the VAE from Exercise 16.16 but with a 2D latent space:

a) Encode the test set and plot the latent codes colored by digit class. b) Create a 20x20 grid of points in $[-3, 3] \times [-3, 3]$, decode each, and display the manifold. c) Pick two test images of different digits. Interpolate between their latent codes (10 steps) and visualize the decoded interpolation. d) Identify which regions of latent space correspond to which digits.

Exercise 16.18: KL Annealing

Implement KL annealing for the VAE from Exercise 16.16.

a) Implement linear annealing: $\beta$ increases from 0 to 1 over the first 10 epochs. b) Implement cyclical annealing: $\beta$ cycles between 0 and 1 multiple times during training. c) Compare reconstructions and samples from both schedules against a fixed $\beta = 1$ baseline. d) Plot KL divergence vs. epoch for all three approaches. Which avoids posterior collapse best?

Part D: Applied Exercises

Exercise 16.19: Anomaly Detection with Autoencoders

Train an autoencoder on only digits 0--4 from MNIST.

a) Compute reconstruction errors for all test digits (0--9). b) Plot the distribution of reconstruction errors for "normal" digits (0--4) vs. "anomalous" digits (5--9). c) Set a threshold at the 95th percentile of normal reconstruction errors. What is the detection rate for anomalous digits? d) Plot the ROC curve and compute the AUC.

Exercise 16.20: Image Denoising Pipeline

Build a practical image denoising pipeline using a convolutional denoising autoencoder.

a) Train on Fashion-MNIST with additive Gaussian noise ($\sigma = 0.3$). b) Evaluate denoising quality using PSNR and SSIM metrics. c) Compare against a simple Gaussian blur baseline. d) Test generalization: train on one noise level, evaluate on different noise levels.

Exercise 16.21: Conditional VAE

Implement a Conditional VAE that conditions on digit class for MNIST.

a) Modify the encoder to take concatenated input $[\mathbf{x}, \mathbf{c}]$ where $\mathbf{c}$ is a one-hot class vector. b) Modify the decoder to take concatenated input $[\mathbf{z}, \mathbf{c}]$. c) Generate digits conditioned on each class (0--9). Show a 10x10 grid. d) Fix $\mathbf{z}$ and vary $\mathbf{c}$. What does this reveal about what $\mathbf{z}$ encodes vs. what $\mathbf{c}$ encodes?

Exercise 16.22: Representation Quality Evaluation

Compare representations learned by four methods on CIFAR-10: undercomplete autoencoder, VAE, SimCLR (simplified), and random initialization.

a) Train each method (simplified architectures are fine). b) Evaluate using linear probing: freeze the encoder, train a linear classifier on the representations. c) Evaluate using k-NN (k=5) classification on the representations. d) Report and compare accuracies. Which method produces the best representations?

Part E: Research and Extension

Exercise 16.23: VQ-VAE Implementation

Implement a Vector Quantized VAE with a codebook of size 512.

a) Implement the vector quantization layer with straight-through estimator for gradients. b) Implement the commitment loss and codebook loss. c) Train on MNIST and visualize the codebook usage (how many codebook entries are actively used). d) Compare reconstruction quality with the standard VAE from Exercise 16.16.

Exercise 16.24: Disentanglement with Beta-VAE

Train $\beta$-VAE models on MNIST with $\beta \in \{0.1, 1, 4, 10, 50\}$.

a) For each $\beta$, encode the test set and compute the variance of each latent dimension. b) For the $\beta = 4$ model, vary one latent dimension at a time (keeping others fixed) and decode. Do individual dimensions control interpretable factors? c) Propose and implement a simple disentanglement metric based on the latent traversals.

Exercise 16.25: Data Augmentation Strategy for SimCLR

Study the effect of augmentation strategy on SimCLR performance.

a) Implement SimCLR with these augmentation combinations: (i) crop only, (ii) crop + flip, (iii) crop + flip + color jitter, (iv) crop + flip + color jitter + blur. b) Train each variant on CIFAR-10 (simplified backbone) for the same number of epochs. c) Evaluate using linear probing. d) Which augmentations contribute most to performance? Does this match the findings from the original SimCLR paper?

Exercise 16.26: Wasserstein Autoencoder

The Wasserstein Autoencoder (WAE) replaces the KL divergence in the VAE with the Maximum Mean Discrepancy (MMD):

$$\text{MMD}^2(q_\phi(\mathbf{z}), p(\mathbf{z})) = \mathbb{E}[k(\mathbf{z}, \mathbf{z}')] - 2\mathbb{E}[k(\mathbf{z}, \mathbf{z}^*)] + \mathbb{E}[k(\mathbf{z}^*, \mathbf{z}^{*'})]$$

where $k$ is a kernel function, $\mathbf{z}, \mathbf{z}' \sim q_\phi$ and $\mathbf{z}^*, \mathbf{z}^{*'} \sim p$.

a) Implement the MMD penalty with a Gaussian RBF kernel. b) Train a WAE on MNIST. c) Compare the latent space structure and sample quality with the VAE.

Exercise 16.27: Contrastive Learning for Tabular Data

Adapt the contrastive learning framework to tabular data.

a) Design appropriate "augmentations" for tabular data (e.g., feature dropout, Gaussian noise, feature permutation within a column). b) Implement a simplified SimCLR pipeline for the Iris or Wine dataset. c) Evaluate the learned representations using linear probing vs. training directly on the raw features. d) Discuss the challenges of contrastive learning for tabular data vs. images.

Exercise 16.28: Interpolation Quality Metrics

Develop a quantitative metric for latent space interpolation quality.

a) For a VAE trained on MNIST, interpolate between all pairs of digit classes (0--9). b) For each interpolation, use a pretrained classifier to predict the class of each intermediate decoded image. c) Define "interpolation smoothness" as the fraction of interpolations where the predicted class transitions monotonically (no back-and-forth switching). d) Compare this metric across different latent dimensions ($k \in \{2, 8, 32\}$).

Implement an autoencoder that jointly encodes images and their class labels into a shared latent space.

a) Design an encoder that takes both an image and a label and produces a single latent code. b) Design a decoder that reconstructs both the image and the label from the latent code. c) Train on MNIST and evaluate: can the model reconstruct the label from the image alone (by marginalizing over the label input)? d) How does the shared latent space differ from image-only and label-only latent spaces?

Exercise 16.30: Representation Learning Survey

Write a technical comparison (1--2 pages) of the representation learning methods covered in this chapter.

a) Organize your comparison along these axes: training objective, latent space structure, generative capability, scalability, and downstream task performance. b) For each method, identify one real-world application where it would be the best choice and explain why. c) Discuss the trend from reconstruction-based to contrastive-based objectives. What drove this shift? d) Speculate on the next evolution of self-supervised learning.