Chapter 17: Quiz

Test your understanding of Generative Adversarial Networks. Each question has one correct answer unless stated otherwise.


Question 1

What is the role of the discriminator in a GAN?

  • A) To generate realistic data samples from noise
  • B) To classify whether an input is real or generated (fake)
  • C) To compute the reconstruction loss
  • D) To encode data into a latent representation
Answer **B)** The discriminator takes a data point (either real from the dataset or generated by the generator) and outputs the probability that it is real. It is trained to correctly distinguish real from fake data, while the generator is trained to fool it.

Question 2

What type of game does GAN training implement?

  • A) A cooperative game where both networks minimize the same loss
  • B) A minimax game where the generator minimizes and the discriminator maximizes the same objective
  • C) A zero-sum game where both networks are trained independently
  • D) A sequential game where the discriminator is fully trained before the generator starts
Answer **B)** GANs implement a minimax game: $\min_G \max_D V(D, G)$. The discriminator maximizes the value function (getting better at classifying real vs. fake), while the generator minimizes it (getting better at fooling the discriminator). This adversarial dynamic drives both networks to improve.

Question 3

When the GAN reaches its global optimum, what does the optimal discriminator output?

  • A) 1 for all inputs
  • B) 0 for all inputs
  • C) 0.5 for all inputs
  • D) Random values between 0 and 1
Answer **C)** At the global optimum, $p_g = p_{\text{data}}$, so the optimal discriminator is $D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})} = \frac{1}{2}$. The discriminator cannot distinguish real from generated data because they follow the same distribution.

Question 4

What is mode collapse in GAN training?

  • A) The discriminator's loss diverges to infinity
  • B) The generator produces only a small subset of the data's diversity, ignoring other modes
  • C) The training process converges too quickly
  • D) The generator learns to produce random noise
Answer **B)** Mode collapse occurs when the generator discovers a few outputs that consistently fool the discriminator and produces only those, ignoring the full diversity of the data distribution. For example, on MNIST, the generator might produce only one or two digit types instead of all ten.

Question 5

Why does the non-saturating generator loss $-\log D(G(\mathbf{z}))$ work better than the original $\log(1 - D(G(\mathbf{z})))$ in practice?

  • A) It has a lower computational cost
  • B) It provides stronger gradients early in training when the discriminator easily identifies fakes
  • C) It guarantees convergence to the global optimum
  • D) It eliminates mode collapse
Answer **B)** Early in training, $D(G(\mathbf{z})) \approx 0$, making $\log(1 - D(G(\mathbf{z}))) \approx 0$ with near-zero gradient. The non-saturating version $-\log D(G(\mathbf{z}))$ provides large gradients when $D(G(\mathbf{z})) \approx 0$, giving the generator a strong learning signal to improve.

Question 6

Which of the following is NOT a DCGAN architectural guideline?

  • A) Replace pooling layers with strided convolutions
  • B) Use batch normalization in both generator and discriminator
  • C) Use ReLU in the generator and LeakyReLU in the discriminator
  • D) Use large fully connected layers in the middle of the network
Answer **D)** DCGAN recommends removing fully connected layers, using fully convolutional architectures instead. The other three guidelines are core DCGAN principles: strided convolutions replace pooling, BatchNorm stabilizes training (with exceptions at generator output and discriminator input), and ReLU/LeakyReLU are the recommended activations.

Question 7

What activation function does the DCGAN generator use in its output layer?

  • A) ReLU
  • B) Sigmoid
  • C) Tanh
  • D) Softmax
Answer **C)** The DCGAN generator uses Tanh in the output layer, mapping generated pixel values to $[-1, 1]$. This matches the preprocessing convention of normalizing real images to the same range. All other generator layers use ReLU.

Question 8

What is the fundamental problem with the Jensen-Shannon divergence that WGAN addresses?

  • A) It is too expensive to compute
  • B) It is constant (provides zero gradient) when the supports of the real and generated distributions do not overlap
  • C) It requires labeled data to compute
  • D) It only works for Gaussian distributions
Answer **B)** When $p_{\text{data}}$ and $p_g$ have non-overlapping supports (which is typical for high-dimensional data on low-dimensional manifolds), the JSD equals a constant $\log 2$ regardless of how close the distributions are. This means the generator receives no useful gradient information. The Wasserstein distance, in contrast, provides meaningful gradients even with non-overlapping supports.

Question 9

In WGAN, what is the "critic" and how does it differ from a standard GAN discriminator?

  • A) The critic is identical to a discriminator but uses a different name
  • B) The critic outputs a real-valued score (no sigmoid) and is constrained to be Lipschitz continuous
  • C) The critic is a separate network that evaluates the discriminator's performance
  • D) The critic classifies images into multiple categories instead of real/fake
Answer **B)** The WGAN critic replaces the discriminator's sigmoid output with a linear output (real-valued score, not a probability). It is constrained to be 1-Lipschitz continuous (via weight clipping, gradient penalty, or spectral normalization) so that the difference in critic scores between real and generated data approximates the Wasserstein distance.

Question 10

What is the gradient penalty in WGAN-GP?

  • A) A penalty on the generator's gradient magnitude
  • B) A penalty on the critic's gradient norm along interpolations between real and fake samples, encouraging it to have gradient norm close to 1
  • C) A penalty that clips all gradients to a maximum value
  • D) A penalty on the learning rate
Answer **B)** WGAN-GP penalizes the critic when its gradient norm deviates from 1 at points interpolated between real and generated samples: $\lambda \mathbb{E}[(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2]$. This soft constraint is more effective than weight clipping for enforcing the Lipschitz condition.

Question 11

How many critic updates per generator update does WGAN typically use?

  • A) 1
  • B) 5
  • C) 10
  • D) 50
Answer **B)** WGAN typically uses 5 critic updates per generator update. Unlike standard GANs where an overly strong discriminator causes vanishing gradients, in WGAN, a well-trained critic provides better Wasserstein distance estimates and thus better gradients for the generator.

Question 12

In a conditional GAN, what additional input does the generator receive?

  • A) The discriminator's current loss
  • B) A condition signal (e.g., class label) that specifies what to generate
  • C) A copy of a real image from the dataset
  • D) The current epoch number
Answer **B)** The conditional GAN generator receives both a noise vector $\mathbf{z}$ and a condition $\mathbf{y}$ (e.g., a one-hot class label). The condition specifies what type of output to generate, enabling controlled generation such as "generate a digit 7" or "generate a cat."

Question 13

What is the key innovation of StyleGAN's mapping network?

  • A) It reduces the number of parameters in the generator
  • B) It transforms the noise vector $\mathbf{z}$ into an intermediate latent space $\mathcal{W}$ that is more disentangled
  • C) It replaces the discriminator
  • D) It computes the FID score during training
Answer **B)** The mapping network $f: \mathcal{Z} \to \mathcal{W}$ transforms the noise vector into an intermediate latent space where different dimensions are more likely to control independent factors of variation (pose, hair color, etc.). This disentanglement is key to StyleGAN's ability to control generation at different semantic levels.

Question 14

In StyleGAN, how does the style vector $\mathbf{w}$ control generation?

  • A) By being concatenated with the noise vector
  • B) Through Adaptive Instance Normalization (AdaIN) at each layer, with different layers controlling different levels of detail
  • C) By directly setting pixel values
  • D) Through a classification head
Answer **B)** The style vector $\mathbf{w}$ is injected into the generator at each layer via Adaptive Instance Normalization. Early layers control coarse attributes (pose, face shape), middle layers control medium features (facial features, hair), and late layers control fine details (color, texture). This hierarchical control enables style mixing and fine-grained generation control.

Question 15

What does the Inception Score (IS) measure?

  • A) The Wasserstein distance between real and generated distributions
  • B) The quality (confident classification) and diversity (uniform class distribution) of generated samples
  • C) The discriminator's accuracy
  • D) The training time of the GAN
Answer **B)** IS measures two properties: (1) quality---each generated image should be confidently classified by an Inception network (low entropy of $p(y|x)$), and (2) diversity---the marginal distribution over classes should be uniform (high entropy of $p(y)$). Higher IS indicates better quality and diversity.

Question 16

What is the main limitation of the Inception Score?

  • A) It is too expensive to compute
  • B) It does not compare generated samples to real data, so a model that perfectly generates only one class can score highly
  • C) It requires labeled data
  • D) It only works for face images
Answer **B)** IS evaluates generated samples in isolation without comparing to the real data distribution. A generator that produces perfect images of only dogs would achieve a high IS even if the dataset contains both cats and dogs. This is why FID, which explicitly compares to real data, is generally preferred.

Question 17

What does the Frechet Inception Distance (FID) measure?

  • A) The pixel-level difference between real and generated images
  • B) The Frechet (Wasserstein-2) distance between the Inception feature distributions of real and generated images
  • C) The time to generate one image
  • D) The discriminator's confidence on generated images
Answer **B)** FID models the distributions of real and generated images in Inception-v3 feature space as multivariate Gaussians, then computes the Frechet distance between them: $\text{FID} = \|\mu_r - \mu_g\|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$. Lower FID indicates higher quality and diversity relative to the real data.

Question 18

What is the standard number of generated samples recommended for computing FID?

  • A) 100
  • B) 1,000
  • C) 10,000 to 50,000
  • D) 1,000,000
Answer **C)** Standard practice uses at least 10,000 generated samples, with 50,000 being the most common choice (matching the ImageNet validation set size). Using too few samples leads to unreliable FID estimates due to noisy covariance matrix estimation.

Question 19

Why does training a GAN often require careful hyperparameter tuning compared to training a classifier?

  • A) GANs have more parameters than classifiers
  • B) GAN training is a two-player game where the balance between generator and discriminator is critical; standard optimization theory for single-loss minimization does not directly apply
  • C) GANs always require more data than classifiers
  • D) GANs use different programming languages
Answer **B)** GAN training involves two competing networks optimizing different objectives simultaneously. This minimax game does not have the convergence guarantees of standard single-objective optimization. The relative strength of the generator and discriminator, their learning rates, update frequencies, and architectures must be carefully balanced for stable training.

Question 20

What stabilization technique uses soft labels (e.g., 0.9 instead of 1.0) for real data in GAN training?

  • A) Feature matching
  • B) Label smoothing
  • C) Spectral normalization
  • D) Gradient penalty
Answer **B)** Label smoothing replaces hard real labels (1.0) with soft labels (e.g., 0.9), preventing the discriminator from becoming overconfident. This reduces the discriminator's gradient magnitude for real samples, indirectly providing more useful gradients to the generator.

Question 21

In WGAN, why is the critic's loss a meaningful indicator of sample quality, unlike the discriminator loss in standard GANs?

  • A) The critic's loss uses a different scale
  • B) The critic's loss estimates the Wasserstein distance, which continuously measures how far the generated distribution is from the real distribution
  • C) The critic is trained for more epochs
  • D) The critic uses a different architecture
Answer **B)** The WGAN critic's loss approximates the Wasserstein distance between real and generated distributions. Because the Wasserstein distance is a continuous, meaningful metric of distributional distance, a decreasing critic loss directly indicates that the generated distribution is moving closer to the real one. In standard GANs, the discriminator loss can be low even when sample quality is poor.

Question 22

What is Pix2Pix?

  • A) A GAN for generating random images from noise
  • B) A conditional GAN for paired image-to-image translation (e.g., edges to photos)
  • C) A method for image classification
  • D) A data augmentation technique
Answer **B)** Pix2Pix is a conditional GAN framework for paired image-to-image translation. Given paired examples of input-output images (e.g., edge maps paired with photos), it learns to translate from one domain to another. It uses a U-Net generator and a PatchGAN discriminator, combining adversarial loss with L1 reconstruction loss.

Question 23

What is spectral normalization?

  • A) Normalizing the input images to have zero mean and unit variance
  • B) Dividing each weight matrix in the discriminator/critic by its largest singular value to enforce a Lipschitz constraint
  • C) Applying batch normalization to the discriminator's output
  • D) Normalizing the noise vector before feeding it to the generator
Answer **B)** Spectral normalization divides each weight matrix by its spectral norm (largest singular value), ensuring that each linear layer has a Lipschitz constant of 1. Composed across layers, this bounds the network's overall Lipschitz constant, stabilizing training without the drawbacks of weight clipping.

Question 24

How do GANs compare to VAEs in terms of sample sharpness?

  • A) VAEs produce sharper samples than GANs
  • B) GANs produce sharper samples because the adversarial loss does not suffer from the averaging effect of reconstruction-based losses
  • C) Both produce equally sharp samples
  • D) Neither can produce sharp samples
Answer **B)** GANs produce sharper samples because the generator is directly optimized to fool the discriminator, which can detect blurriness. VAEs optimize a reconstruction loss (MSE or BCE) that tends to produce blurry outputs by averaging over possible reconstructions. The adversarial loss implicitly penalizes any detectable artifacts, including blur.

Question 25

A research team observes that their GAN generates realistic images but all images look nearly identical. What is the most likely problem and solution?

  • A) The learning rate is too high; decrease it
  • B) Mode collapse; switch to WGAN-GP or add minibatch discrimination
  • C) The discriminator is too weak; increase its capacity
  • D) The images need more preprocessing
Answer **B)** Generating realistic but identical (or nearly identical) images is the hallmark of mode collapse. The generator has found a single output that fools the discriminator and produces only that. Switching to WGAN-GP provides better coverage of all modes because the Wasserstein distance captures distributional distance more faithfully. Minibatch discrimination also helps by allowing the discriminator to detect lack of diversity within a batch.