Chapter 27: Key Takeaways
The Diffusion Framework
- Diffusion models learn to generate data by reversing a gradual noising process. The forward process adds Gaussian noise over $T$ timesteps until data becomes pure noise; the reverse process learns to denoise step by step.
- The closed-form forward process $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}$ enables efficient training by sampling any timestep directly, without iterating through intermediate steps.
- The simplified training objective $\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2$ trains the network to predict what noise was added. This is equivalent to denoising score matching and a weighted variational bound.
Noise Schedules and Score Matching
- The noise schedule $\{\beta_t\}$ controls how quickly information is destroyed. The cosine schedule provides more uniform log-SNR spacing than the linear schedule, improving generation quality.
- Score matching provides a theoretical foundation: the noise prediction network estimates the score function $\nabla_\mathbf{x} \log p_t(\mathbf{x})$, and the SDE framework unifies discrete-time DDPM with continuous-time score-based models.
- Different parameterizations (noise prediction, $\mathbf{x}_0$ prediction, velocity prediction) are mathematically equivalent but affect training dynamics and sample quality differently.
Sampling Strategies
- DDPM sampling requires $T$ steps (typically 1000), making it slow. Each step applies the learned reverse transition with added stochastic noise.
- DDIM reformulates the reverse process as a non-Markovian chain, enabling subsampled timestep schedules that generate comparable quality in 50 steps — a 20x speedup. Setting $\sigma_t = 0$ yields a deterministic sampler.
- Advanced samplers (DPM-Solver, consistency models, distillation) further reduce steps to 4-10 or even 1, enabling real-time generation.
Latent Diffusion and Stable Diffusion
- Latent diffusion performs the diffusion process in a compressed latent space (e.g., 64x64x4 instead of 512x512x3), reducing computational cost by 16-50x while preserving perceptual quality.
- Stable Diffusion has three components: a VAE (compresses/decompresses images), a U-Net with cross-attention (denoises in latent space, conditioned on text), and a CLIP text encoder (converts prompts to embeddings).
- The U-Net's cross-attention layers are the key mechanism for text conditioning: image features (queries) attend to text embeddings (keys and values), enabling spatial alignment between words and visual regions.
Guidance and Control
- Classifier-free guidance (CFG) amplifies text conditioning by extrapolating between conditional and unconditional predictions: $\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\varnothing + w(\boldsymbol{\epsilon}_c - \boldsymbol{\epsilon}_\varnothing)$. Scale $w = 7.5$ is a good default; higher values improve alignment but reduce diversity and may introduce artifacts.
- CFG doubles the computational cost per step (two forward passes) but is essential for high-quality conditional generation.
- ControlNet adds spatial conditioning (edges, depth, pose) through a trainable copy of the U-Net encoder connected via zero-initialized convolutions, preserving the pre-trained model's capabilities while learning new control signals.
Fine-Tuning and Customization
- DreamBooth personalizes generation by fine-tuning on 3-5 images of a subject bound to a unique token. Highest fidelity but most expensive.
- LoRA adds low-rank trainable matrices to attention layers (1-100 MB of parameters), providing an excellent efficiency-quality tradeoff for style and concept adaptation.
- Textual Inversion learns a single embedding vector (~10 KB) for a new concept. Most parameter-efficient but least expressive.
Evaluation
- FID measures distributional distance between generated and real images (lower is better). It captures both quality and diversity but requires ~50K samples.
- CLIP Score measures text-image alignment (higher is better). It is fast and useful for comparing prompts and guidance scales.
- Human evaluation remains the gold standard, as automated metrics do not fully capture perceptual quality, creativity, or prompt adherence.
Practical Guidelines
- For text-to-image, start with guidance scale 7.5, 30-50 DDIM or DPM-Solver steps, and always use negative prompts for quality.
- For fine-tuning, prefer LoRA unless identity preservation demands DreamBooth.
- Monitor both FID and CLIP Score during development — optimizing one can hurt the other.
- Be aware of ethical considerations: training data copyright, deepfake potential, demographic bias, and environmental cost.