Chapter 17: Exercises

Part A: Conceptual Foundations

Exercise 17.1: Minimax Objective Analysis

Consider the GAN value function: $$V(D, G) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} [\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z} [\log(1 - D(G(\mathbf{z})))]$$

a) For a fixed generator $G$, what is the optimal discriminator $D^*(\mathbf{x})$? Derive it by taking the derivative and setting it to zero. b) Show that when $p_g = p_{\text{data}}$, the optimal discriminator outputs $D^*(\mathbf{x}) = 0.5$ for all $\mathbf{x}$. c) Substitute $D^*$ back into $V(D^*, G)$ and show that the result involves the Jensen-Shannon divergence. d) What is the value of $V(D^*, G)$ at the global optimum? Verify it numerically.

Exercise 17.2: Non-Saturating Loss

The non-saturating generator loss replaces $\min_G \mathbb{E}[\log(1 - D(G(\mathbf{z})))]$ with $\max_G \mathbb{E}[\log D(G(\mathbf{z}))]$.

a) Compute the gradient of the original loss $\frac{\partial}{\partial G}\log(1 - D(G(\mathbf{z})))$ when $D(G(\mathbf{z})) \approx 0$ (early training). b) Compute the gradient of the non-saturating loss $\frac{\partial}{\partial G}\log D(G(\mathbf{z}))$ when $D(G(\mathbf{z})) \approx 0$. c) Compare the magnitudes. Why does the non-saturating version provide better learning signal? d) Do both losses have the same fixed point? Prove or disprove.

Exercise 17.3: Mode Collapse Scenarios

A GAN is trained on a dataset containing equal numbers of cats and dogs.

a) Describe what mode collapse would look like in this scenario. b) Explain mechanistically how mode collapse occurs: what does the generator learn, and how does the discriminator respond? c) Why doesn't the standard GAN objective explicitly penalize lack of diversity? d) Propose two modifications to the training procedure that would mitigate mode collapse.

Exercise 17.4: Discriminator Capacity

Explain the "Goldilocks" principle for discriminator capacity:

a) What happens when the discriminator is too powerful relative to the generator? b) What happens when the discriminator is too weak? c) How do practical techniques like label smoothing and spectral normalization address the overly-powerful discriminator problem? d) In WGAN, why is it beneficial for the critic to be powerful?

Exercise 17.5: DCGAN Architecture

A DCGAN generator takes a 100-dimensional noise vector and produces a $64 \times 64 \times 3$ image.

a) Design the generator architecture layer by layer, specifying the output shape after each transposed convolution. b) Calculate the total number of parameters in the generator. c) Design the mirror-image discriminator architecture. d) Why does DCGAN use LeakyReLU (slope 0.2) in the discriminator instead of ReLU?

Exercise 17.6: Wasserstein Distance Intuition

Consider two 1D distributions: $p = \text{Uniform}[0, 1]$ and $q = \text{Uniform}[\theta, \theta + 1]$.

a) Compute the Wasserstein-1 distance $W(p, q)$ as a function of $\theta$. b) Compute the Jensen-Shannon divergence $D_{\text{JS}}(p, q)$ for $\theta = 0$, $\theta = 0.5$, and $\theta = 2$. c) Plot both distances as functions of $\theta$ (sketch). Which provides a useful gradient for all $\theta$? d) At $\theta = 2$ (non-overlapping supports), what gradient does each distance provide to a generator trying to minimize distance by adjusting $\theta$?

Part B: Mathematical Analysis

Exercise 17.7: Gradient Penalty Derivation

In WGAN-GP, the gradient penalty is: $$\mathcal{L}_{\text{GP}} = \lambda \mathbb{E}_{\hat{\mathbf{x}}} [(\|\nabla_{\hat{\mathbf{x}}} D(\hat{\mathbf{x}})\|_2 - 1)^2]$$

a) Why is the target gradient norm 1 (not 0 or some other value)? How does this relate to the 1-Lipschitz constraint? b) Why are the interpolated points $\hat{\mathbf{x}} = \alpha \mathbf{x}_{\text{real}} + (1 - \alpha) \mathbf{x}_{\text{fake}}$ used instead of arbitrary points? c) Compute the gradient of the gradient penalty with respect to the critic parameters. Why does this involve second-order derivatives? d) Why is batch normalization incompatible with the gradient penalty?

Exercise 17.8: Spectral Normalization

Spectral normalization divides each weight matrix $\mathbf{W}$ by its spectral norm $\sigma(\mathbf{W})$ (largest singular value).

a) Show that a neural network layer $f(\mathbf{x}) = \sigma(\mathbf{W}\mathbf{x})$ with a 1-Lipschitz activation $\sigma$ is 1-Lipschitz if $\sigma(\mathbf{W}) = 1$. b) How does the Lipschitz constant compose across multiple layers? If each layer is 1-Lipschitz, what is the overall Lipschitz constant? c) Why is spectral normalization preferred over weight clipping in practice? d) Can spectral normalization be used in the generator? What effect would it have?

Exercise 17.9: Inception Score Analysis

The Inception Score is $\text{IS} = \exp(\mathbb{E}_{\mathbf{x}} [D_{\text{KL}}(p(y|\mathbf{x}) \| p(y))])$.

a) If every generated image is classified with 100% confidence as the same class, what is the IS? (Assume 10 classes.) b) If every generated image is classified with 100% confidence, and each class is equally represented, what is the IS? c) If every generated image produces a uniform class distribution $p(y|\mathbf{x}) = 1/10$, what is the IS? d) Explain why IS fails to detect a model that generates perfect images of only one class.

Exercise 17.10: FID Computation

For univariate Gaussians $\mathcal{N}(\mu_r, \sigma_r^2)$ and $\mathcal{N}(\mu_g, \sigma_g^2)$:

a) Simplify the FID formula to the 1D case. b) Compute FID for $\mu_r = 0, \sigma_r = 1, \mu_g = 0.5, \sigma_g = 1.2$. c) Compute FID for $\mu_r = 0, \sigma_r = 1, \mu_g = 0, \sigma_g = 1$ (identical distributions). Verify it equals 0. d) How does FID change if you double the number of evaluation samples? (Hint: consider the estimation error of $\mu$ and $\Sigma$.)

Exercise 17.11: Conditional GAN Analysis

In a conditional GAN for MNIST:

a) How does conditioning change the generator's input? Specify the input dimension if $\mathbf{z} \in \mathbb{R}^{100}$ and the label is one-hot with 10 classes. b) How should the discriminator incorporate the condition? Describe two approaches. c) What happens if the generator ignores the condition and produces random digits? How would the discriminator respond? d) Compare conditional GAN to conditional VAE (Chapter 16). What are the trade-offs?

Exercise 17.12: GAN vs. VAE Trade-offs

A data scientist needs to build a generative model for handwritten digits. Compare GANs and VAEs along the following dimensions:

a) Sample quality (sharpness). b) Training stability and ease of implementation. c) Latent space structure and interpolation quality. d) Mode coverage and diversity. e) Likelihood estimation capability. f) For which downstream applications would you recommend each?

Part C: Coding Exercises

Exercise 17.13: Simple GAN for 2D Data

Implement a GAN to learn a 2D Gaussian mixture (e.g., 8 Gaussians arranged in a circle).

a) Implement generator and discriminator as small MLPs. b) Train with the standard GAN loss and visualize generated samples every 500 iterations. c) Observe and document any mode collapse (which modes are missing?). d) Experiment with the number of discriminator updates per generator update ($k \in \{1, 3, 5\}$).

Exercise 17.14: DCGAN Implementation

Implement a DCGAN for MNIST following the guidelines from Section 17.3.

a) Implement the generator and discriminator architectures with proper activation functions, BatchNorm, and weight initialization. b) Train for 25 epochs with Adam ($\beta_1 = 0.5$). c) Generate a grid of 64 images at epochs 1, 5, 10, and 25. Document the quality progression. d) Plot the generator and discriminator losses over training.

Exercise 17.15: WGAN-GP Implementation

Implement WGAN-GP and compare with the standard GAN from Exercise 17.14.

a) Modify the discriminator to output a linear score (no sigmoid). b) Implement the gradient penalty computation. c) Train with 5 critic updates per generator update. d) Plot the Wasserstein estimate over training. Does it correlate with sample quality?

Exercise 17.16: Conditional GAN

Implement a conditional GAN for MNIST class-conditional generation.

a) Modify the generator to accept a concatenated $[\mathbf{z}, \mathbf{y}]$ input. b) Modify the discriminator to accept the condition (via concatenation or projection). c) Generate a $10 \times 10$ grid: each row is a different digit class, each column uses a different noise vector. d) Fix $\mathbf{z}$ and vary the condition. Do the generated digits share a consistent "style" across classes?

Exercise 17.17: GAN Training Monitoring Dashboard

Create a comprehensive training monitoring system for GANs.

a) Log generator loss, discriminator loss (real and fake), and gradient norms. b) Implement FID computation on a subset of generated vs. real images (use a simple feature extractor if Inception is not available). c) Track the diversity of generated samples (e.g., number of unique predicted classes using a pretrained classifier). d) Implement early stopping based on FID.

Exercise 17.18: Label Smoothing and Noise

Implement two stabilization techniques for the standard GAN:

a) One-sided label smoothing: use 0.9 instead of 1.0 for real labels, keep 0.0 for fake. b) Instance noise: add Gaussian noise to both real and fake images fed to the discriminator, with noise level decaying over training. c) Compare training stability (measured by loss variance over the last 100 iterations) with and without these techniques. d) Compare final FID scores.

Part D: Applied Exercises

Exercise 17.19: Data Augmentation with GANs

Use a trained GAN to augment a small training set for classification.

a) Train a classifier on the original MNIST training set (full). Record test accuracy. b) Create a reduced training set (500 samples). Train the same classifier. Record accuracy. c) Train a GAN on the reduced training set. Generate 5,000 synthetic images. d) Train the classifier on reduced + synthetic data. Does accuracy improve? e) Discuss the limitations of this approach.

Exercise 17.20: Interpolation and Latent Space Exploration

Using a trained DCGAN:

a) Perform linear interpolation between pairs of noise vectors. Visualize the decoded images. b) Perform spherical interpolation (slerp) between the same pairs. Compare with linear interpolation. c) Attempt "vector arithmetic" in latent space (e.g., $z_\text{smile} - z_\text{neutral} + z_\text{other}$). Does it work on MNIST (e.g., "thick 1" - "thin 1" + "thin 7" = "thick 7")? d) Why is the GAN's latent space generally less structured than the VAE's?

Exercise 17.21: Progressive Training

Implement a simplified version of progressive growing:

a) Start with a $4 \times 4$ generator and discriminator. b) After 5 epochs, add layers for $8 \times 8$. c) After 5 more epochs, add layers for $16 \times 16$. d) Compare final sample quality with a non-progressive approach trained for the same total number of epochs.

Exercise 17.22: GAN Evaluation Suite

Build a comprehensive evaluation pipeline:

a) Implement Inception Score using a pretrained classifier (you may use a simple CNN trained on MNIST as a substitute for InceptionV3). b) Implement FID using features from the same classifier. c) Implement precision and recall for generative models (Sajjadi et al., 2018). d) Evaluate three models: DCGAN, WGAN-GP, and a conditional GAN. Which scores best on each metric?

Part E: Research and Extension

Exercise 17.23: LSGAN Implementation

The Least Squares GAN replaces the BCE loss with squared error:

$$\mathcal{L}_D = \frac{1}{2}\mathbb{E}[(D(\mathbf{x}) - 1)^2] + \frac{1}{2}\mathbb{E}[D(G(\mathbf{z}))^2]$$ $$\mathcal{L}_G = \frac{1}{2}\mathbb{E}[(D(G(\mathbf{z})) - 1)^2]$$

a) Implement LSGAN and train on MNIST. b) Compare training stability with standard GAN and WGAN-GP. c) Analyze: why does the squared error provide non-vanishing gradients? d) Compare FID scores across all three variants.

Exercise 17.24: CycleGAN (Simplified)

Implement a simplified CycleGAN for unpaired domain transfer between MNIST and inverted MNIST (white digits on black vs. black digits on white).

a) Implement two generators ($G_{A \to B}$, $G_{B \to A}$) and two discriminators ($D_A$, $D_B$). b) Implement the cycle consistency loss: $\|G_{B \to A}(G_{A \to B}(\mathbf{x}_A)) - \mathbf{x}_A\|_1$. c) Train and evaluate. Can the model learn to invert images without paired examples? d) What happens if you remove the cycle consistency loss?

Exercise 17.25: Spectral Normalization Implementation

Implement spectral normalization from scratch (without using torch.nn.utils.spectral_norm).

a) Implement the power iteration method to estimate the largest singular value. b) Apply spectral normalization to the discriminator of a DCGAN. c) Compare training stability with and without spectral normalization. d) Verify that the estimated spectral norm is close to the true value (computed via SVD).

Exercise 17.26: GAN Theory Deep Dive

Write a detailed analysis (1--2 pages) addressing the following:

a) Why does the original GAN paper prove convergence but practical GANs rarely converge? What assumptions in the proof fail in practice? b) Compare the JS divergence, Wasserstein distance, and Maximum Mean Discrepancy (MMD) as training objectives for GANs. What are the theoretical and practical trade-offs? c) The "GAN zoo" contains hundreds of variants. Categorize the most impactful contributions and explain why the field has consolidated around a few architectures (StyleGAN, BigGAN). d) How have diffusion models changed the landscape of generative modeling? In what scenarios do GANs still have advantages?

Exercise 17.27: Disentangled Generation with InfoGAN

InfoGAN maximizes the mutual information between a subset of latent variables $\mathbf{c}$ and the generated output $G(\mathbf{z}, \mathbf{c})$.

a) Implement InfoGAN with both discrete (10 categories) and continuous (2 dimensions) latent codes for MNIST. b) After training, vary the discrete code. Does each value correspond to a different digit? c) Vary each continuous code independently. What visual attribute does each control? d) Compare the disentanglement achieved by InfoGAN with that of $\beta$-VAE (Exercise 16.24).

Exercise 17.28: GAN for Tabular Data

Adapt the GAN framework for generating synthetic tabular data.

a) Design generator and discriminator architectures appropriate for mixed data types (continuous, categorical, binary). b) Handle categorical variables using Gumbel-softmax or one-hot encoding. c) Train on a small tabular dataset (e.g., Iris, Wine, or synthetic data). d) Evaluate synthetic data quality using: (i) marginal distribution comparison, (ii) correlation matrix comparison, (iii) "train on synthetic, test on real" classifier accuracy.

Exercise 17.29: Wasserstein Distance Estimation

Implement a standalone Wasserstein distance estimator (not as part of GAN training).

a) Generate samples from two known 2D distributions. b) Train a critic network to estimate the Wasserstein distance. c) Compare the estimated distance with the true distance (computed analytically for simple distributions). d) Investigate how the number of critic training steps affects the quality of the estimate.

Exercise 17.30: GAN Security: Deepfake Detection

GANs raise important ethical concerns about deepfakes.

a) Train a GAN to generate face-like images (simplified: use MNIST or Fashion-MNIST). b) Train a binary classifier to detect GAN-generated images vs. real images. c) Evaluate the detector's accuracy as the GAN improves over training epochs. d) Discuss: what properties of GAN-generated images make them detectable? How might future GANs overcome these detectable artifacts?