Chapter 27: Key Takeaways

The Diffusion Framework

Diffusion models learn to generate data by reversing a gradual noising process. The forward process adds Gaussian noise over $T$ timesteps until data becomes pure noise; the reverse process learns to denoise step by step.
The closed-form forward process $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}$ enables efficient training by sampling any timestep directly, without iterating through intermediate steps.
The simplified training objective $\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2$ trains the network to predict what noise was added. This is equivalent to denoising score matching and a weighted variational bound.

The noise schedule $\{\beta_t\}$ controls how quickly information is destroyed. The cosine schedule provides more uniform log-SNR spacing than the linear schedule, improving generation quality.
Score matching provides a theoretical foundation: the noise prediction network estimates the score function $\nabla_\mathbf{x} \log p_t(\mathbf{x})$, and the SDE framework unifies discrete-time DDPM with continuous-time score-based models.
Different parameterizations (noise prediction, $\mathbf{x}_0$ prediction, velocity prediction) are mathematically equivalent but affect training dynamics and sample quality differently.

DDPM sampling requires $T$ steps (typically 1000), making it slow. Each step applies the learned reverse transition with added stochastic noise.
DDIM reformulates the reverse process as a non-Markovian chain, enabling subsampled timestep schedules that generate comparable quality in 50 steps — a 20x speedup. Setting $\sigma_t = 0$ yields a deterministic sampler.
Advanced samplers (DPM-Solver, consistency models, distillation) further reduce steps to 4-10 or even 1, enabling real-time generation.

Latent diffusion performs the diffusion process in a compressed latent space (e.g., 64x64x4 instead of 512x512x3), reducing computational cost by 16-50x while preserving perceptual quality.
Stable Diffusion has three components: a VAE (compresses/decompresses images), a U-Net with cross-attention (denoises in latent space, conditioned on text), and a CLIP text encoder (converts prompts to embeddings).
The U-Net's cross-attention layers are the key mechanism for text conditioning: image features (queries) attend to text embeddings (keys and values), enabling spatial alignment between words and visual regions.

Classifier-free guidance (CFG) amplifies text conditioning by extrapolating between conditional and unconditional predictions: $\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\varnothing + w(\boldsymbol{\epsilon}_c - \boldsymbol{\epsilon}_\varnothing)$. Scale $w = 7.5$ is a good default; higher values improve alignment but reduce diversity and may introduce artifacts.
CFG doubles the computational cost per step (two forward passes) but is essential for high-quality conditional generation.
ControlNet adds spatial conditioning (edges, depth, pose) through a trainable copy of the U-Net encoder connected via zero-initialized convolutions, preserving the pre-trained model's capabilities while learning new control signals.

DreamBooth personalizes generation by fine-tuning on 3-5 images of a subject bound to a unique token. Highest fidelity but most expensive.
LoRA adds low-rank trainable matrices to attention layers (1-100 MB of parameters), providing an excellent efficiency-quality tradeoff for style and concept adaptation.
Textual Inversion learns a single embedding vector (~10 KB) for a new concept. Most parameter-efficient but least expressive.

FID measures distributional distance between generated and real images (lower is better). It captures both quality and diversity but requires ~50K samples.
CLIP Score measures text-image alignment (higher is better). It is fast and useful for comparing prompts and guidance scales.
Human evaluation remains the gold standard, as automated metrics do not fully capture perceptual quality, creativity, or prompt adherence.

For text-to-image, start with guidance scale 7.5, 30-50 DDIM or DPM-Solver steps, and always use negative prompts for quality.
For fine-tuning, prefer LoRA unless identity preservation demands DreamBooth.
Monitor both FID and CLIP Score during development — optimizing one can hurt the other.
Be aware of ethical considerations: training data copyright, deepfake potential, demographic bias, and environmental cost.