Chapter 27: Exercises — Diffusion Models and Image Generation

Conceptual Exercises

Exercise 1: Forward Process Derivation

Starting from the Markov transition $q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1}, \beta_t\mathbf{I})$, derive the closed-form expression $q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$ by induction. Show all intermediate steps.

Exercise 2: Signal-to-Noise Ratio Analysis

For a linear schedule with $\beta_1 = 10^{-4}$ and $\beta_T = 0.02$ and $T = 1000$: (a) Plot $\bar{\alpha}_t$ as a function of $t$. (b) Plot the signal-to-noise ratio $\text{SNR}(t) = \bar{\alpha}_t / (1 - \bar{\alpha}_t)$ in log scale. (c) At what timestep does the SNR equal 1 (equal parts signal and noise)?

Exercise 3: Posterior Mean Derivation

Derive the posterior mean $\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)$ by applying Bayes' rule to $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \propto q(\mathbf{x}_t | \mathbf{x}_{t-1}) q(\mathbf{x}_{t-1} | \mathbf{x}_0)$ and completing the square to obtain the Gaussian form.

Exercise 4: Noise Prediction vs. x0 Prediction

Show algebraically that predicting the noise $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ is equivalent to predicting the clean image $\hat{\mathbf{x}}_0(\mathbf{x}_t, t)$. Express the conversion formula in both directions.

Exercise 5: DDIM Deterministic Mapping

Explain why DDIM with $\sigma_t = 0$ creates a deterministic mapping from noise to image. What are the practical implications for (a) latent space interpolation and (b) image editing via inversion?

Exercise 6: Cosine vs. Linear Schedule

Compare the cosine and linear noise schedules. Plot $\bar{\alpha}_t$ for both schedules with $T = 1000$. At which timestep range does the cosine schedule retain significantly more signal than the linear schedule? Why does this improve generation quality?

Exercise 7: Latent Diffusion Compression

A Stable Diffusion VAE compresses 512x512x3 images to 64x64x4 latents. Calculate: (a) the compression ratio, (b) the theoretical minimum bits per pixel for lossless compression assuming uniform distribution, and (c) why 4 latent channels (not 3) are used.

Exercise 8: Classifier-Free Guidance Analysis

For classifier-free guidance with scale $w$, the effective noise prediction is $\tilde{\boldsymbol{\epsilon}} = (1 - w)\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \varnothing) + w \cdot \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c})$. (a) What happens when $w = 0$? (b) When $w = 1$? (c) Why does $w > 1$ improve text alignment? (d) Why does excessive $w$ cause artifacts?

Exercise 9: ControlNet Zero Initialization

Explain why zero initialization of ControlNet's connection layers is critical. What would happen if these layers were randomly initialized? How does this relate to the concept of "harmless initialization" in fine-tuning?

Exercise 10: Generative Model Comparison

Create a detailed comparison table of GANs, VAEs, Normalizing Flows, Autoregressive Models, and Diffusion Models across the following dimensions: training stability, sample quality, diversity, likelihood computation, sampling speed, and controllability.

Implementation Exercises

Exercise 11: Noise Schedule Implementation

Implement linear, cosine, and sigmoid noise schedules in PyTorch. For each schedule, compute and plot $\beta_t$, $\alpha_t$, $\bar{\alpha}_t$, and $\text{SNR}(t)$.

Exercise 12: Forward Diffusion Visualization

Write code that takes a sample image and visualizes the forward diffusion process at timesteps $t = 0, 100, 250, 500, 750, 1000$. Display the noisy images and their pixel histograms.

Exercise 13: Simple Denoising Network

Implement a small U-Net (4 resolution levels, 2 residual blocks each) with sinusoidal timestep embeddings. Train it to denoise MNIST images with a fixed noise level $\sigma = 0.5$.

Exercise 14: DDPM Training Loop

Implement the complete DDPM training loop: random timestep sampling, noise addition, noise prediction, and loss computation. Train on CIFAR-10 for 50 epochs and visualize generated samples.

Exercise 15: DDIM Sampler

Implement the DDIM sampling algorithm with configurable $\sigma_t$ and number of steps. Compare 1000-step DDPM, 50-step DDIM ($\sigma=0$), and 50-step DDIM ($\sigma > 0$) on the same trained model.

Exercise 16: Timestep Embedding Module

Implement sinusoidal timestep embeddings and a two-layer MLP projection. Verify that nearby timesteps produce similar embeddings and distant timesteps produce different embeddings.

Exercise 17: Cross-Attention for Conditioning

Implement a cross-attention layer that conditions U-Net features on text embeddings. Use random text embeddings to verify the attention mechanism works correctly by checking output shapes and gradient flow.

Exercise 18: VAE Encoder-Decoder

Implement a simple convolutional VAE with KL regularization. Train on CIFAR-10 and measure reconstruction quality (MSE, PSNR) as a function of latent dimension.

Exercise 19: FID Score Computation

Implement FID score computation using a pre-trained InceptionV3 network. Compute FID between: (a) two splits of CIFAR-10 training data (should be low), (b) CIFAR-10 and random noise (should be high).

Exercise 20: Guidance Scale Sweep

Using a pre-trained Stable Diffusion model, generate images with the same prompt and seed at guidance scales $w = 1, 3, 5, 7.5, 10, 15, 20$. Display the results and compute CLIP scores for each.

Applied Exercises

Exercise 21: DreamBooth Fine-Tuning

Fine-tune a Stable Diffusion model using DreamBooth on 5-10 images of a specific object. Generate the object in novel contexts and evaluate identity preservation.

Exercise 22: LoRA for Style Transfer

Train a LoRA adapter on a dataset of images in a specific artistic style. Compare the computational cost and quality with full model fine-tuning.

Exercise 23: Inpainting Pipeline

Build an inpainting pipeline that takes an image and a text-described mask region, generates a mask, and fills the region using Stable Diffusion inpainting. Evaluate on multiple examples.

Exercise 24: Image-to-Image Translation

Implement the SDEdit (image-to-image) pipeline with configurable strength. Generate variations of input images at different strength levels and analyze the quality-fidelity tradeoff.

Exercise 25: ControlNet Edge-to-Image

Use a pre-trained ControlNet with Canny edge conditioning to generate images from edge maps. Extract edges from real photographs, generate new images from those edges, and compare with the originals.

Challenge Exercises

Exercise 26: Unconditional DDPM from Scratch

Train an unconditional DDPM from scratch on CelebA-HQ 256x256. Implement the full pipeline including the U-Net, training loop, sampling, and FID evaluation. Target FID < 30.

Exercise 27: Latent Diffusion Model

Implement a complete latent diffusion pipeline: train a VAE, then train a diffusion model in latent space. Compare generation quality with pixel-space diffusion at equivalent training compute.

Exercise 28: Multi-Step Distillation

Implement progressive distillation for your trained diffusion model. Start with a 1000-step teacher and progressively halve the number of steps to obtain a 4-step student. Measure FID at each distillation stage.

Exercise 29: Custom ControlNet Training

Train a ControlNet for a novel conditioning modality (e.g., color palette, text layout, or sketch). Collect or generate 10K+ conditioning-image pairs and train on a pre-trained Stable Diffusion base.

Exercise 30: Video Generation Extension

Extend a trained 2D diffusion model to generate short video clips by adding temporal attention layers. Generate 16-frame clips and evaluate temporal consistency using optical flow metrics.