Chapter 17: Key Takeaways

Core Concepts

GANs learn to generate data through adversarial training between a generator and a discriminator. The generator creates fake data from noise; the discriminator classifies real vs. fake. This minimax game drives the generator toward producing data indistinguishable from real data.
The GAN objective implicitly minimizes the Jensen-Shannon divergence between real and generated distributions. When the optimal discriminator $D^*(\mathbf{x}) = p_{\text{data}}/(p_{\text{data}} + p_g)$ is substituted back into the value function, the generator's objective reduces to minimizing $D_{\text{JS}}(p_{\text{data}} \| p_g)$.
GANs produce sharp outputs because the discriminator can detect blurriness. Unlike VAEs, which average over possible outputs due to reconstruction losses, GANs are implicitly penalized for any detectable artifact---including blur---because the discriminator would catch it.

Mode collapse is the primary GAN failure mode. The generator finds a few outputs that fool the discriminator and produces only those, ignoring the data's full diversity. WGAN-GP and minibatch discrimination are the standard remedies.
Training instability arises from the two-player game dynamics. GAN training is not standard optimization; it seeks a Nash equilibrium rather than a minimum. Vanishing gradients, oscillation, and sensitivity to hyperparameters are all consequences of this game-theoretic nature.
The non-saturating generator loss provides better gradients early in training. Replacing $\log(1 - D(G(z)))$ with $-\log D(G(z))$ avoids the vanishing gradient problem when the discriminator dominates, which is common at the start of training.

DCGAN established the architectural blueprint for convolutional GANs. The five key guidelines are: strided convolutions instead of pooling, batch normalization in both networks (with exceptions), no fully connected layers in deeper architectures, ReLU in the generator (Tanh at output), and LeakyReLU in the discriminator.
Weight initialization from N(0, 0.02) is important for stable DCGAN training. This small standard deviation prevents large activations and gradients in early training iterations.
Adam with $\beta_1 = 0.5$ is the standard optimizer for GANs. The reduced momentum (from the default 0.9) prevents the optimizer from accumulating too much momentum in the oscillating gradient landscape of adversarial training.

The Wasserstein distance solves the fundamental gradient problem of standard GANs. When real and generated distributions have non-overlapping support, the JSD is constant (no gradient), while the Wasserstein distance provides meaningful, continuous gradients.
WGAN-GP enforces the Lipschitz constraint via gradient penalty. Penalizing the critic's gradient norm along interpolations between real and fake samples ($\lambda = 10$ standard) is more effective than weight clipping. BatchNorm should not be used in the critic with WGAN-GP.
In WGAN, the critic loss is a meaningful quality indicator. Unlike standard GAN discriminator loss, the WGAN critic loss estimates the Wasserstein distance and correlates with generation quality. A decreasing Wasserstein estimate indicates improving samples.

Conditional GANs enable controlled generation by providing class labels to both generator and discriminator. The condition specifies what to generate (content), while the noise vector controls how it looks (style). This also implicitly improves mode coverage.
StyleGAN's mapping network produces a disentangled intermediate latent space $\mathcal{W}$. Injecting the style vector $\mathbf{w}$ through AdaIN at different layers controls different levels of detail: coarse (pose, shape), medium (features), and fine (texture, color).
Pix2Pix and CycleGAN extended GANs to image-to-image translation. Pix2Pix requires paired data; CycleGAN uses cycle consistency loss to work with unpaired data. Both combine adversarial loss with reconstruction loss.

FID is the standard metric for evaluating generative models. It measures the Frechet distance between real and generated feature distributions in Inception space. Lower FID means better quality and diversity. Use at least 10,000 (ideally 50,000) samples.
Inception Score measures quality and diversity but does not compare to real data. IS evaluates whether generated samples are confidently and diversely classified, but a model that generates only one class perfectly can still score well.
No single metric captures all aspects of generation quality. Use FID for overall quality, IS for a quick check, and precision/recall for disentangling quality from diversity. Always include visual inspection.

Start with DCGAN for quick prototyping; switch to WGAN-GP for stable training. DCGAN is simpler and faster; WGAN-GP provides more reliable convergence and meaningful training curves.
Monitor D(real), D(fake), and visual samples---not just the loss values. A discriminator loss near zero means the discriminator is too strong. D(real) around 0.7--0.9 and D(fake) gradually increasing toward 0.3--0.5 indicates healthy training.
GANs generate faster but diffusion models now achieve higher quality. For applications requiring real-time generation (e.g., interactive systems), GANs remain advantageous. For maximum quality without speed constraints, consider diffusion models (Chapter 18).