Chapter 27: Quiz — Diffusion Models and Image Generation

Test your understanding of diffusion models with these questions. Try to answer each question before revealing the solution.

Question 1

What is the simplified training objective for DDPM, and why is it effective?

Show Answer

The simplified training objective is $\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2]$, which trains the neural network to predict the noise $\boldsymbol{\epsilon}$ that was added to create $\mathbf{x}_t$ from $\mathbf{x}_0$. It is effective because: (1) predicting noise at each timestep is a well-conditioned regression problem, (2) the objective naturally decomposes the complex generation task into many simple denoising steps, (3) it is equivalent to a weighted variational bound on the data log-likelihood, and (4) the uniform timestep sampling ensures the network learns to denoise at all noise levels.

Question 2

How does the forward diffusion process enable efficient training without iterating through all timesteps?

Show Answer

The forward process has a closed-form expression: $q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$, which means $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}$ where $\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)$. This allows us to sample $\mathbf{x}_t$ at any arbitrary timestep $t$ directly from $\mathbf{x}_0$ in a single step, without computing intermediate $\mathbf{x}_1, \ldots, \mathbf{x}_{t-1}$. During training, we randomly sample $t \sim \text{Uniform}(1, T)$ and compute $\mathbf{x}_t$ directly, making each training iteration O(1) regardless of $T$.

Question 3

Why does the cosine noise schedule improve generation quality compared to the linear schedule?

Show Answer

The linear schedule drops the signal-to-noise ratio too rapidly in the early timesteps, meaning that a disproportionate number of timesteps correspond to nearly pure noise where the model cannot learn meaningful denoising. The cosine schedule provides a more gradual and uniform progression of the log-SNR across timesteps, ensuring that more timesteps correspond to intermediate noise levels where the model learns the most useful denoising operations. In particular, the cosine schedule retains more signal in the middle timesteps ($t = 300$-$700$), which correspond to the noise levels where the model makes critical decisions about global structure and composition.

Question 4

What is the fundamental difference between DDPM and DDIM sampling, and why can DDIM use fewer steps?

Show Answer

DDPM defines a stochastic Markov chain for the reverse process, where each step adds random noise. DDIM defines a non-Markovian process that shares the same marginal distributions $q(\mathbf{x}_t | \mathbf{x}_0)$ as DDPM but can be made deterministic (when $\sigma_t = 0$). DDIM can use fewer steps because: (1) the non-Markovian formulation allows skipping timesteps — we can go from $\mathbf{x}_{t}$ to $\mathbf{x}_{t-k}$ directly using only the marginals, (2) the deterministic mapping produces a smooth trajectory in latent space that can be traversed with larger steps, and (3) the implicit model can be viewed as an ODE solver, which can use adaptive step sizes. With 50 steps, DDIM produces quality comparable to 1000-step DDPM.

Question 5

Explain the relationship between noise prediction ($\boldsymbol{\epsilon}$-prediction) and score matching. How are they mathematically connected?

Show Answer

They are equivalent up to a scaling factor. The score function of the noisy distribution at timestep $t$ is $\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) = -\frac{\boldsymbol{\epsilon}}{\sqrt{1-\bar{\alpha}_t}}$, where $\boldsymbol{\epsilon}$ is the noise that was added to create $\mathbf{x}_t$. Therefore, a noise prediction network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ is equivalent to a score network $\mathbf{s}_\theta(\mathbf{x}_t, t) = -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}}$. The DDPM training objective (noise prediction MSE) is proportional to the denoising score matching objective. This connection, formalized by Song et al. (2021) in the SDE framework, unifies the DDPM and score-based perspectives.

Question 6

What are the three main components of the Stable Diffusion architecture, and what role does each play?

Show Answer

1. **VAE (Variational Autoencoder)**: Compresses images from pixel space (512x512x3) to a latent space (64x64x4) and back. The encoder maps images to compact latent representations; the decoder reconstructs images from latents. This compression makes diffusion computationally tractable. 2. **U-Net with cross-attention**: The core denoising network that predicts noise in the latent space. It receives the noisy latent, the timestep, and text embeddings. Self-attention models spatial relationships within the image; cross-attention layers enable text conditioning by allowing spatial features to attend to text embeddings. 3. **CLIP Text Encoder**: Converts text prompts into a sequence of embedding vectors (77x768). These embeddings serve as keys and values in the U-Net's cross-attention layers, guiding the generation process to produce images matching the text description.

Question 7

How does classifier-free guidance work, and why does it require two forward passes per denoising step?

Show Answer

Classifier-free guidance amplifies the effect of text conditioning during sampling. The model is trained with random condition dropout (10-20% of the time, the text condition is replaced with a null embedding). At inference, for each denoising step, two predictions are made: one conditioned on the text ($\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c})$) and one unconditional ($\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \varnothing)$). The guided prediction extrapolates away from the unconditional toward the conditional: $\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \varnothing) + w \cdot (\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c}) - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \varnothing))$. This requires two forward passes because both the conditional and unconditional noise estimates must be computed separately by the U-Net. In practice, they are batched together for efficiency, but the total FLOPs still double.

Question 8

Why does latent diffusion dramatically reduce computational cost compared to pixel-space diffusion?

Show Answer

Latent diffusion operates in a compressed space that is 48x smaller than pixel space (64x64x4 = 16,384 values vs. 512x512x3 = 786,432 values). Since the U-Net's computational cost scales super-linearly with spatial resolution (due to attention layers with quadratic complexity), this reduction yields approximately 16-50x savings in FLOPs per denoising step. The key insight is that most of the perceptual information in images is contained in a low-dimensional manifold, and a well-trained autoencoder can capture this manifold while discarding imperceptible high-frequency details. The autoencoder is trained only once and reused, so its cost is amortized. The diffusion model then focuses its capacity on modeling the semantically meaningful latent distribution rather than pixel-level details.

Question 9

How does ControlNet add spatial conditioning without modifying the pre-trained Stable Diffusion weights?

Show Answer

ControlNet creates a trainable copy of the Stable Diffusion U-Net encoder and connects it to the original (frozen) model through "zero convolutions" — 1x1 convolutional layers initialized to zero weights and zero biases. The trainable copy processes the control signal (e.g., edge map, depth map) alongside the noisy latent. Its outputs are passed through zero convolutions and added to the corresponding skip connections in the frozen model. Because zero convolutions start at zero, the ControlNet initially has no effect on the pre-trained model, preserving its capabilities. As training progresses, the zero convolutions learn non-zero weights, gradually incorporating the control signal. This design ensures training stability and preserves the pre-trained model's generation quality.

Question 10

What is the SDE framework for diffusion models, and what are its advantages?

Show Answer

The SDE framework (Song et al., 2021) formulates diffusion as a continuous-time stochastic differential equation: $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w}$. The reverse-time SDE is $d\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g(t)^2\nabla_\mathbf{x}\log p_t(\mathbf{x})]\,dt + g(t)\,d\bar{\mathbf{w}}$, which depends on the score function. Advantages: (1) It unifies DDPM and score-based models as different discretizations of the same continuous process. (2) It enables the use of general-purpose ODE/SDE solvers for sampling, including adaptive step-size methods. (3) The probability flow ODE (obtained by setting the noise to zero in the reverse SDE) provides a deterministic mapping, enabling exact likelihood computation. (4) It provides a principled framework for designing new noise schedules and sampling algorithms.

Question 11

Compare DreamBooth, LoRA, and Textual Inversion for customizing diffusion models. When would you choose each?

Show Answer

**DreamBooth** fine-tunes the full model (or large portions via LoRA) on 3-5 images of a specific subject, binding it to a unique token. It produces the highest identity fidelity but requires the most compute and storage (~2-4 GB per concept). Choose when identity preservation is critical (e.g., generating a specific person or pet in novel scenes). **LoRA** adds low-rank trainable matrices to attention layers while keeping the base model frozen. It requires only 1-100 MB of additional parameters and trains quickly. Choose when you need style adaptation or moderate concept learning with minimal storage overhead. LoRA can be combined with DreamBooth for efficient subject-driven generation. **Textual Inversion** learns only a new embedding vector (a single token) while keeping the entire model frozen. It requires minimal storage (~10 KB) but has limited expressive power. Choose when the concept can be adequately captured by a single token embedding and when computational resources are very limited.

Question 12

What are the key evaluation metrics for diffusion models, and what does each measure?

Show Answer

**FID (Frechet Inception Distance)**: Measures the distance between the distribution of generated images and real images in Inception feature space, using Frechet distance between fitted Gaussians. Lower is better. It captures both quality and diversity but is sensitive to the number of samples (typically 50K). **CLIP Score**: Measures the cosine similarity between CLIP embeddings of the generated image and the text prompt. Higher is better. It evaluates text-image alignment but does not assess image quality independent of the prompt. **IS (Inception Score)**: Computes $\exp(\mathbb{E}[D_{\text{KL}}(p(y|x) \| p(y))])$ using an Inception classifier. Higher is better. It measures both quality (high confidence predictions) and diversity (uniform marginal distribution) but does not compare to real data. **LPIPS (Learned Perceptual Image Patch Similarity)**: Measures perceptual distance between pairs of images. Useful for evaluating reconstruction quality in VAEs and super-resolution. Lower is better. **Human evaluation**: Pairwise preference studies where humans compare generated images. The gold standard but expensive and not reproducible.

Question 13

Why is the U-Net architecture preferred over other architectures for diffusion models?

Show Answer

The U-Net is preferred for several reasons: (1) **Skip connections** preserve fine-grained spatial information that would otherwise be lost in the bottleneck, which is critical for predicting pixel-level noise. (2) **Multi-scale processing** through the encoder-decoder structure allows the network to model both coarse global structure (in the bottleneck) and fine local details (at full resolution). (3) **Attention at lower resolutions** is computationally efficient because the spatial dimensions are small. (4) **Inductive bias for denoising**: The architecture is well-suited for tasks where the output has the same spatial structure as the input, which is exactly the case for noise prediction. (5) **Historical precedent**: U-Net was originally designed for image segmentation, another dense prediction task, making it a natural fit. However, recent work (DiT, Sora) shows that pure transformer architectures can match or exceed U-Nets when scaled sufficiently.

Question 14

Explain the concept of "noise schedule" and why it is important for diffusion model performance.

Show Answer

The noise schedule $\{\beta_t\}_{t=1}^T$ determines how quickly noise is added during the forward process and how much information is destroyed at each timestep. It controls the distribution of noise levels that the model is trained on. A good noise schedule should: (1) reach near-pure noise at $t=T$ (so the initial distribution for sampling is approximately Gaussian), (2) destroy information gradually (so each denoising step makes only a small correction), (3) allocate sufficient training capacity to informative noise levels (where the model makes meaningful structural decisions), and (4) avoid regions where $\beta_t$ is too small (wasting steps) or too large (making denoising too difficult). The schedule significantly impacts generation quality — the cosine schedule improves over linear by providing more uniform coverage of log-SNR values, and different datasets may benefit from different schedules.

Question 15

How do consistency models achieve single-step generation, and what tradeoff do they make?

Show Answer

Consistency models (Song et al., 2023) learn a function $f_\theta(\mathbf{x}_t, t)$ that maps any point on the diffusion trajectory directly to the clean data $\mathbf{x}_0$, satisfying the self-consistency property: $f_\theta(\mathbf{x}_t, t) = f_\theta(\mathbf{x}_{t'}, t')$ for any $t, t'$ on the same trajectory. They are trained either through distillation from a pre-trained diffusion model (consistency distillation) or directly (consistency training). At inference, a single function evaluation produces a sample. The tradeoff is quality: single-step samples from consistency models are generally lower quality than multi-step diffusion samples. However, consistency models also support multi-step sampling (by alternating between function evaluation and noise injection), which progressively improves quality and can approach diffusion model quality with 2-4 steps.