Chapter 27: Exercises — Diffusion Models and Image Generation
Conceptual Exercises
Exercise 1: Forward Process Derivation
Starting from the Markov transition $q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1}, \beta_t\mathbf{I})$, derive the closed-form expression $q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$ by induction. Show all intermediate steps.
Exercise 2: Signal-to-Noise Ratio Analysis
For a linear schedule with $\beta_1 = 10^{-4}$ and $\beta_T = 0.02$ and $T = 1000$: (a) Plot $\bar{\alpha}_t$ as a function of $t$. (b) Plot the signal-to-noise ratio $\text{SNR}(t) = \bar{\alpha}_t / (1 - \bar{\alpha}_t)$ in log scale. (c) At what timestep does the SNR equal 1 (equal parts signal and noise)?
Exercise 3: Posterior Mean Derivation
Derive the posterior mean $\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)$ by applying Bayes' rule to $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \propto q(\mathbf{x}_t | \mathbf{x}_{t-1}) q(\mathbf{x}_{t-1} | \mathbf{x}_0)$ and completing the square to obtain the Gaussian form.
Exercise 4: Noise Prediction vs. x0 Prediction
Show algebraically that predicting the noise $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ is equivalent to predicting the clean image $\hat{\mathbf{x}}_0(\mathbf{x}_t, t)$. Express the conversion formula in both directions.
Exercise 5: DDIM Deterministic Mapping
Explain why DDIM with $\sigma_t = 0$ creates a deterministic mapping from noise to image. What are the practical implications for (a) latent space interpolation and (b) image editing via inversion?
Exercise 6: Cosine vs. Linear Schedule
Compare the cosine and linear noise schedules. Plot $\bar{\alpha}_t$ for both schedules with $T = 1000$. At which timestep range does the cosine schedule retain significantly more signal than the linear schedule? Why does this improve generation quality?
Exercise 7: Latent Diffusion Compression
A Stable Diffusion VAE compresses 512x512x3 images to 64x64x4 latents. Calculate: (a) the compression ratio, (b) the theoretical minimum bits per pixel for lossless compression assuming uniform distribution, and (c) why 4 latent channels (not 3) are used.
Exercise 8: Classifier-Free Guidance Analysis
For classifier-free guidance with scale $w$, the effective noise prediction is $\tilde{\boldsymbol{\epsilon}} = (1 - w)\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \varnothing) + w \cdot \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c})$. (a) What happens when $w = 0$? (b) When $w = 1$? (c) Why does $w > 1$ improve text alignment? (d) Why does excessive $w$ cause artifacts?
Exercise 9: ControlNet Zero Initialization
Explain why zero initialization of ControlNet's connection layers is critical. What would happen if these layers were randomly initialized? How does this relate to the concept of "harmless initialization" in fine-tuning?
Exercise 10: Generative Model Comparison
Create a detailed comparison table of GANs, VAEs, Normalizing Flows, Autoregressive Models, and Diffusion Models across the following dimensions: training stability, sample quality, diversity, likelihood computation, sampling speed, and controllability.
Implementation Exercises
Exercise 11: Noise Schedule Implementation
Implement linear, cosine, and sigmoid noise schedules in PyTorch. For each schedule, compute and plot $\beta_t$, $\alpha_t$, $\bar{\alpha}_t$, and $\text{SNR}(t)$.
Exercise 12: Forward Diffusion Visualization
Write code that takes a sample image and visualizes the forward diffusion process at timesteps $t = 0, 100, 250, 500, 750, 1000$. Display the noisy images and their pixel histograms.
Exercise 13: Simple Denoising Network
Implement a small U-Net (4 resolution levels, 2 residual blocks each) with sinusoidal timestep embeddings. Train it to denoise MNIST images with a fixed noise level $\sigma = 0.5$.
Exercise 14: DDPM Training Loop
Implement the complete DDPM training loop: random timestep sampling, noise addition, noise prediction, and loss computation. Train on CIFAR-10 for 50 epochs and visualize generated samples.
Exercise 15: DDIM Sampler
Implement the DDIM sampling algorithm with configurable $\sigma_t$ and number of steps. Compare 1000-step DDPM, 50-step DDIM ($\sigma=0$), and 50-step DDIM ($\sigma > 0$) on the same trained model.
Exercise 16: Timestep Embedding Module
Implement sinusoidal timestep embeddings and a two-layer MLP projection. Verify that nearby timesteps produce similar embeddings and distant timesteps produce different embeddings.
Exercise 17: Cross-Attention for Conditioning
Implement a cross-attention layer that conditions U-Net features on text embeddings. Use random text embeddings to verify the attention mechanism works correctly by checking output shapes and gradient flow.
Exercise 18: VAE Encoder-Decoder
Implement a simple convolutional VAE with KL regularization. Train on CIFAR-10 and measure reconstruction quality (MSE, PSNR) as a function of latent dimension.
Exercise 19: FID Score Computation
Implement FID score computation using a pre-trained InceptionV3 network. Compute FID between: (a) two splits of CIFAR-10 training data (should be low), (b) CIFAR-10 and random noise (should be high).
Exercise 20: Guidance Scale Sweep
Using a pre-trained Stable Diffusion model, generate images with the same prompt and seed at guidance scales $w = 1, 3, 5, 7.5, 10, 15, 20$. Display the results and compute CLIP scores for each.
Applied Exercises
Exercise 21: DreamBooth Fine-Tuning
Fine-tune a Stable Diffusion model using DreamBooth on 5-10 images of a specific object. Generate the object in novel contexts and evaluate identity preservation.
Exercise 22: LoRA for Style Transfer
Train a LoRA adapter on a dataset of images in a specific artistic style. Compare the computational cost and quality with full model fine-tuning.
Exercise 23: Inpainting Pipeline
Build an inpainting pipeline that takes an image and a text-described mask region, generates a mask, and fills the region using Stable Diffusion inpainting. Evaluate on multiple examples.
Exercise 24: Image-to-Image Translation
Implement the SDEdit (image-to-image) pipeline with configurable strength. Generate variations of input images at different strength levels and analyze the quality-fidelity tradeoff.
Exercise 25: ControlNet Edge-to-Image
Use a pre-trained ControlNet with Canny edge conditioning to generate images from edge maps. Extract edges from real photographs, generate new images from those edges, and compare with the originals.
Challenge Exercises
Exercise 26: Unconditional DDPM from Scratch
Train an unconditional DDPM from scratch on CelebA-HQ 256x256. Implement the full pipeline including the U-Net, training loop, sampling, and FID evaluation. Target FID < 30.
Exercise 27: Latent Diffusion Model
Implement a complete latent diffusion pipeline: train a VAE, then train a diffusion model in latent space. Compare generation quality with pixel-space diffusion at equivalent training compute.
Exercise 28: Multi-Step Distillation
Implement progressive distillation for your trained diffusion model. Start with a 1000-step teacher and progressively halve the number of steps to obtain a 4-step student. Measure FID at each distillation stage.
Exercise 29: Custom ControlNet Training
Train a ControlNet for a novel conditioning modality (e.g., color palette, text layout, or sketch). Collect or generate 10K+ conditioning-image pairs and train on a pre-trained Stable Diffusion base.
Exercise 30: Video Generation Extension
Extend a trained 2D diffusion model to generate short video clips by adding temporal attention layers. Generate 16-frame clips and evaluate temporal consistency using optical flow metrics.