Chapter 16: Quiz

Test your understanding of Autoencoders and Representation Learning. Each question has one correct answer unless stated otherwise.


Question 1

What is the primary purpose of the bottleneck in an undercomplete autoencoder?

  • A) To speed up training by reducing computation
  • B) To force the network to learn a compressed representation that captures essential data structure
  • C) To prevent the decoder from overfitting
  • D) To ensure the latent space is Gaussian-distributed
Answer **B)** The bottleneck (latent dimension smaller than input dimension) forces the encoder to compress the input, discarding irrelevant information and retaining only the most salient features. Without this constraint, the autoencoder could trivially learn the identity function.

Question 2

A linear autoencoder with MSE loss learns the same subspace as which classical algorithm?

  • A) K-means clustering
  • B) Linear Discriminant Analysis (LDA)
  • C) Principal Component Analysis (PCA)
  • D) Independent Component Analysis (ICA)
Answer **C)** A linear autoencoder with MSE loss learns to project data onto the subspace spanned by the top-$k$ principal components. The learned basis may differ from PCA by a rotation within the subspace, but the subspace itself is identical.

Question 3

In a sparse autoencoder with KL divergence sparsity, what does the target sparsity parameter $\rho$ represent?

  • A) The fraction of weights that should be zero
  • B) The desired average activation of each latent unit across the training batch
  • C) The learning rate for the sparsity penalty
  • D) The probability of dropping out a latent unit
Answer **B)** The target sparsity $\rho$ (typically small, like 0.05) is the desired average activation of each latent unit. The KL divergence penalty pushes the actual average activation $\hat{\rho}_j$ toward $\rho$, ensuring most units are inactive for any given input.

Question 4

What distinguishes a denoising autoencoder from a standard autoencoder?

  • A) It uses a different optimizer
  • B) The input is corrupted and the model is trained to reconstruct the clean original
  • C) It has more layers than a standard autoencoder
  • D) It uses a discrete latent space
Answer **B)** A denoising autoencoder corrupts the input (via Gaussian noise, masking, etc.) and trains the network to reconstruct the clean, uncorrupted original. The loss compares the output to the clean input, not the corrupted one. This forces the model to learn robust features rather than trivial identity mappings.

Question 5

In the VAE framework, what does the encoder network output?

  • A) A single deterministic latent vector
  • B) The parameters of a probability distribution (mean and log-variance) over the latent space
  • C) A reconstructed version of the input
  • D) The loss value for the current sample
Answer **B)** The VAE encoder outputs the parameters of the approximate posterior $q_\phi(\mathbf{z}|\mathbf{x})$, specifically the mean $\boldsymbol{\mu}$ and log-variance $\log \boldsymbol{\sigma}^2$ of a Gaussian distribution. This probabilistic output is what distinguishes VAEs from deterministic autoencoders.

Question 6

The ELBO (Evidence Lower Bound) consists of two terms. What are they?

  • A) Reconstruction loss and L2 regularization
  • B) Reconstruction term and KL divergence between approximate posterior and prior
  • C) Encoder loss and decoder loss
  • D) Mean squared error and binary cross-entropy
Answer **B)** The ELBO is: $\text{ELBO} = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$. The first term encourages good reconstruction; the second regularizes the approximate posterior to stay close to the prior.

Question 7

Why is the reparameterization trick necessary in VAE training?

  • A) It reduces the variance of gradient estimates
  • B) It allows gradients to flow through the stochastic sampling step by expressing samples as a deterministic function of learnable parameters and external noise
  • C) It converts continuous latent variables to discrete ones
  • D) It replaces backpropagation with a more efficient algorithm
Answer **B)** Sampling $\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})$ is not differentiable with respect to $\phi$. The reparameterization trick writes $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, making $\mathbf{z}$ a differentiable function of the encoder outputs $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$.

Question 8

What is posterior collapse in a VAE?

  • A) The decoder produces identical outputs for all inputs
  • B) The approximate posterior matches the prior for all inputs, meaning the encoder ignores the input entirely
  • C) The training loss diverges to infinity
  • D) The latent space becomes too high-dimensional
Answer **B)** Posterior collapse occurs when $q_\phi(\mathbf{z}|\mathbf{x}) \approx p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ for all $\mathbf{x}$. The KL term becomes zero, and the decoder generates from pure noise. This typically happens when the decoder is too powerful or the KL term dominates early in training.

Question 9

What is KL annealing in VAE training?

  • A) Reducing the learning rate over time
  • B) Gradually increasing the weight of the KL divergence term from 0 to 1 during training
  • C) Decreasing the latent dimension over training
  • D) Annealing the temperature in the softmax output
Answer **B)** KL annealing multiplies the KL term by $\beta$ that increases from 0 to 1 over a warm-up period. This allows the model to first learn good reconstructions (when $\beta \approx 0$) before the regularization pressure of the KL term kicks in, helping to prevent posterior collapse.

Question 10

In a $\beta$-VAE with $\beta > 1$, what is the effect on learned representations?

  • A) Higher reconstruction quality
  • B) Faster training convergence
  • C) More disentangled latent representations at the cost of reconstruction quality
  • D) Lower memory usage
Answer **C)** Setting $\beta > 1$ increases the pressure on the approximate posterior to match the prior (a factorized Gaussian). This encourages each latent dimension to be independent, leading to disentangled representations where individual dimensions capture single factors of variation. The trade-off is reduced reconstruction quality due to the stronger bottleneck.

Question 11

Which of the following is NOT a common corruption strategy for denoising autoencoders?

  • A) Additive Gaussian noise
  • B) Masking noise (setting random dimensions to zero)
  • C) Adversarial perturbations from a discriminator network
  • D) Salt-and-pepper noise
Answer **C)** Adversarial perturbations from a discriminator are used in GANs (Chapter 17), not in standard denoising autoencoders. The three common corruption strategies for DAEs are additive Gaussian noise, masking noise (dropout-like), and salt-and-pepper noise.

Question 12

In SimCLR, what is the role of the projection head?

  • A) It generates data augmentations
  • B) It maps encoder representations to a space where the contrastive loss is computed, and is discarded after pretraining
  • C) It classifies the input into categories
  • D) It reconstructs the original image from the augmented version
Answer **B)** The projection head is a small MLP that maps the encoder's output to the space where the NT-Xent contrastive loss is applied. After pretraining, the projection head is discarded and only the encoder is used for downstream tasks. This is because the projection head discards information useful for downstream tasks in order to optimize the contrastive objective.

Question 13

What does the temperature parameter $\tau$ control in the NT-Xent loss?

  • A) The learning rate
  • B) The amount of data augmentation
  • C) How sharply the similarity distribution is peaked, affecting sensitivity to hard negatives
  • D) The latent space dimension
Answer **C)** Lower temperature $\tau$ produces a sharper distribution over similarities, making the model more sensitive to the hardest negatives (samples most similar to the anchor but from different images). Higher $\tau$ produces a more uniform distribution. Typical values are 0.07--0.5.

Question 14

How does BYOL avoid the need for negative pairs?

  • A) By using a very large batch size
  • B) By using an asymmetric architecture with a predictor network and an exponential moving average target network
  • C) By using a reconstruction loss instead of a contrastive loss
  • D) By training only the encoder, not the decoder
Answer **B)** BYOL uses an online network (with a predictor) and a target network (without a predictor, updated via EMA). The online network predicts the target network's representation. The asymmetry (predictor + stop-gradient on target) prevents collapse to a trivial constant solution, eliminating the need for explicit negative pairs.

Question 15

In contrastive learning, what is a "positive pair"?

  • A) Two different images from the same class
  • B) Two differently augmented views of the same image
  • C) An image and its class label
  • D) An image and its reconstruction
Answer **B)** A positive pair consists of two augmented views of the same image (e.g., different random crops with different color jittering). The contrastive objective trains the encoder to produce similar representations for these views. Note that class labels are not used; the pairing is based on the augmentation of the same image, not class membership.

Question 16

What is "linear probing" in the context of evaluating self-supervised representations?

  • A) Training a linear regression model on raw pixels
  • B) Freezing the pretrained encoder and training only a linear classifier on top of the learned representations
  • C) Probing the model's weights with linear algebra techniques
  • D) Gradually unfreezing layers during fine-tuning
Answer **B)** Linear probing freezes the pretrained encoder and adds a single linear layer (fully connected + softmax) that is trained on the downstream task with labels. High linear probe accuracy demonstrates that the frozen representations are already linearly separable, indicating high-quality feature learning.

Question 17

For a VAE with a 2D latent space trained on MNIST, what would you expect to see when decoding a grid of points in latent space?

  • A) Random noise with no structure
  • B) A smooth manifold showing gradual transitions between digit types
  • C) Exact copies of training images
  • D) Only a single digit type repeated everywhere
Answer **B)** The KL regularization in the VAE ensures the latent space is smooth and structured. Decoding a grid of points reveals a generative manifold where nearby points decode to similar digits, with gradual morphing between digit types as you move across the grid.

Question 18

Why do VAEs tend to produce blurry reconstructions compared to deterministic autoencoders?

  • A) They use fewer parameters
  • B) The Gaussian decoder assumption and MSE loss encourage the model to hedge its bets, producing averages over possible outputs
  • C) They train for fewer epochs
  • D) The latent space is too large
Answer **B)** With a Gaussian decoder $p_\theta(\mathbf{x}|\mathbf{z}) = \mathcal{N}(\mathbf{x}; \hat{\mathbf{x}}, \sigma^2\mathbf{I})$, the optimal output for a given $\mathbf{z}$ is the mean of all training images that could have produced that $\mathbf{z}$. This averaging effect produces blurry reconstructions, especially in regions of latent space where multiple modes overlap.

Question 19

What is the key difference between a Conditional VAE (CVAE) and a standard VAE?

  • A) The CVAE uses a different loss function
  • B) The CVAE conditions both the encoder and decoder on additional information (e.g., class labels)
  • C) The CVAE has no latent space
  • D) The CVAE does not use the reparameterization trick
Answer **B)** A CVAE conditions both the encoder $q_\phi(\mathbf{z}|\mathbf{x}, \mathbf{c})$ and the decoder $p_\theta(\mathbf{x}|\mathbf{z}, \mathbf{c})$ on additional information $\mathbf{c}$, such as a class label. This enables controlled generation: you can specify what type of output to generate by providing the appropriate condition.

Question 20

In the VQ-VAE, what replaces the continuous latent space of the standard VAE?

  • A) A binary latent space
  • B) A discrete codebook of learned embedding vectors
  • C) A hierarchical latent space
  • D) A latent space with uniform prior
Answer **B)** The VQ-VAE replaces the continuous Gaussian latent space with a discrete codebook. The encoder output is mapped to the nearest codebook vector using vector quantization. This discrete bottleneck avoids the blurriness of continuous VAEs and enables high-quality reconstruction.

Question 21

Which self-supervised pretext task involves masking random patches of an image and training the model to reconstruct them?

  • A) Contrastive learning
  • B) Rotation prediction
  • C) Masked Image Modeling (MAE)
  • D) Colorization
Answer **C)** Masked Autoencoders (MAE) mask random patches of an input image and train the model to reconstruct the masked regions. This is the visual analog of BERT's masked language modeling and has been shown to learn excellent visual representations.

Question 22

Why is the choice of data augmentation critical in contrastive learning?

  • A) Augmentations determine the batch size
  • B) Augmentations define what invariances the model learns; too weak allows shortcuts, too strong destroys semantic content
  • C) Augmentations control the learning rate
  • D) Augmentations are only used during evaluation
Answer **B)** The augmentations define what the model considers "the same." If augmentations are too weak (e.g., only tiny translations), the model can use low-level shortcuts to match views. If too strong (e.g., extreme distortion), the two views may no longer share semantic content. The right augmentations force the model to learn high-level semantic features.

Question 23

An autoencoder trained only on normal data is used for anomaly detection. How are anomalies identified?

  • A) By checking if the latent code is near zero
  • B) By measuring reconstruction error; anomalous inputs that differ from training data will have high reconstruction error
  • C) By clustering the latent codes and finding small clusters
  • D) By checking the gradient magnitude during a forward pass
Answer **B)** The autoencoder learns to reconstruct normal data well. When presented with anomalous data that differs from the training distribution, the reconstruction will be poor, resulting in high reconstruction error. Thresholding this error provides an anomaly score.

Question 24

What is the relationship between the ELBO and the true log-evidence $\log p_\theta(\mathbf{x})$?

  • A) ELBO = $\log p_\theta(\mathbf{x})$ always
  • B) ELBO $\leq \log p_\theta(\mathbf{x})$, with the gap equal to $D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$
  • C) ELBO $\geq \log p_\theta(\mathbf{x})$
  • D) ELBO and $\log p_\theta(\mathbf{x})$ are unrelated
Answer **B)** The identity $\log p_\theta(\mathbf{x}) = \text{ELBO} + D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x}))$ shows that the ELBO is always a lower bound. The gap is the KL divergence between the approximate and true posteriors. The bound is tight when $q_\phi$ perfectly matches $p_\theta(\mathbf{z}|\mathbf{x})$.

Question 25

Which statement about the progression of representation learning methods is most accurate?

  • A) Reconstruction-based methods have completely replaced contrastive methods
  • B) Contrastive and self-supervised methods generally produce better representations for downstream tasks than reconstruction-based autoencoders
  • C) All representation learning methods produce identical features
  • D) Supervised pretraining is always superior to self-supervised pretraining
Answer **B)** Modern contrastive and self-supervised methods (SimCLR, BYOL, DINO, MAE) generally learn representations that outperform those from reconstruction-based autoencoders on downstream tasks like classification and detection. This is because contrastive objectives directly optimize for semantic similarity, while reconstruction objectives waste capacity on pixel-level details. However, VAEs remain valuable when generative capability or structured latent spaces are needed.