41 min read

> "All you need is a noise schedule and a good denoiser." — Adapted from the deep learning community

Chapter 27: Diffusion Models and Image Generation

"All you need is a noise schedule and a good denoiser." — Adapted from the deep learning community

In 2020, a paper by Ho, Jain, and Abbeel introduced Denoising Diffusion Probabilistic Models (DDPM), demonstrating that a conceptually simple process — gradually adding noise to data and then learning to reverse that process — could generate images rivaling the quality of Generative Adversarial Networks. By 2022, diffusion models had not only surpassed GANs in image quality but had become the foundation of groundbreaking systems like DALL-E 2, Stable Diffusion, and Midjourney, fundamentally transforming how humans create visual content.

This chapter provides a thorough treatment of diffusion models, from their mathematical foundations to their practical implementation in modern image generation systems. You will learn how the forward diffusion process systematically destroys information, how neural networks learn to reverse that destruction, and how architectural innovations like latent diffusion and classifier-free guidance have made these models both practical and controllable. By the end, you will understand every major component of the Stable Diffusion pipeline and be able to implement, modify, and extend diffusion-based generation systems using PyTorch and Hugging Face.

The mathematical elegance of diffusion models is worth appreciating. Unlike GANs, which require a delicate adversarial training dance, or VAEs, which optimize a surrogate bound, diffusion models reduce to a simple denoising objective: given a corrupted image, predict the noise that was added. This conceptual simplicity belies the richness of the framework — as we will see, the same foundation supports text-conditioned generation, image editing, inpainting, super-resolution, and video synthesis. The techniques developed here form the generative backbone for the multimodal systems discussed in Chapter 28 and the video generation models covered in Chapter 30.


27.1 Generative Model Taxonomy

Before diving into diffusion models, it is essential to situate them within the broader landscape of generative models. Each family of generative models offers a distinct approach to learning and sampling from complex data distributions.

27.1.1 The Generative Modeling Problem

The central goal of generative modeling is to learn a model distribution $p_\theta(\mathbf{x})$ that approximates the true data distribution $p_{\text{data}}(\mathbf{x})$. Once learned, we can sample from $p_\theta$ to generate new data points that resemble the training data.

For images, this is extraordinarily challenging. A 256x256 RGB image lives in $\mathbb{R}^{196608}$, and the data distribution occupies a tiny, highly structured manifold within this enormous space. Different generative model families tackle this challenge through fundamentally different strategies.

27.1.2 Generative Adversarial Networks (GANs)

GANs (Goodfellow et al., 2014) use a two-player game: a generator $G$ maps random noise to synthetic data, while a discriminator $D$ attempts to distinguish real from generated samples. The generator learns implicitly — it never models $p_\theta(\mathbf{x})$ directly but instead learns a mapping from a simple distribution (typically Gaussian) to the data distribution.

Strengths: Fast single-step sampling; high-quality outputs with sharp details.

Weaknesses: Training instability (mode collapse, oscillation); no likelihood evaluation; limited diversity; difficult to control.

Key architectures include DCGAN, StyleGAN, StyleGAN2, and StyleGAN3, with StyleGAN2 representing the peak of GAN-based image synthesis quality before diffusion models. The GAN era lasted roughly from 2014 to 2021, producing progressively more impressive results but never fully solving the training instability problem. The emergence of diffusion models in 2020-2021 provided a stable alternative that achieved comparable or superior quality, leading to a rapid decline in new GAN architectures for image generation — though GANs remain relevant in discriminator-based training objectives and for certain real-time applications where single-step generation is essential.

27.1.3 Variational Autoencoders (VAEs)

VAEs (Kingma & Welling, 2014) learn both an encoder $q_\phi(\mathbf{z}|\mathbf{x})$ that maps data to a latent space and a decoder $p_\theta(\mathbf{x}|\mathbf{z})$ that reconstructs data from latent codes. Training maximizes the Evidence Lower Bound (ELBO):

$$\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$$

Strengths: Stable training; meaningful latent space; tractable likelihood bound.

Weaknesses: Blurry outputs due to the reconstruction loss; limited sample quality compared to GANs and diffusion models.

27.1.4 Normalizing Flows

Normalizing flows (Rezende & Mohamed, 2015) learn an invertible transformation $f_\theta: \mathbf{z} \to \mathbf{x}$ from a simple base distribution to the data distribution. Because $f_\theta$ is invertible, exact log-likelihoods can be computed using the change-of-variables formula:

$$\log p_\theta(\mathbf{x}) = \log p(\mathbf{z}) + \log \left|\det \frac{\partial f_\theta^{-1}}{\partial \mathbf{x}}\right|$$

Strengths: Exact likelihood computation; invertible mapping.

Weaknesses: Architectural constraints (must be invertible); limited expressiveness; high computational cost for high-dimensional data.

27.1.5 Autoregressive Models

Autoregressive models (PixelCNN, PixelRNN, and later image GPT) factor the joint distribution as a product of conditionals:

$$p(\mathbf{x}) = \prod_{i=1}^{n} p(x_i | x_1, \ldots, x_{i-1})$$

Strengths: Exact likelihood; strong sample quality.

Weaknesses: Sequential generation (extremely slow for images); fixed ordering assumption.

27.1.6 Diffusion Models: The New Paradigm

Diffusion models define a forward process that gradually adds noise to data and a reverse process that learns to denoise. They combine the stable training of VAEs with the high sample quality of GANs, while also providing likelihood computation. The key insight is that learning to denoise is far easier than learning to generate from scratch — each denoising step only needs to make a small correction, and the neural network is trained on this simple objective at every noise level.


27.2 Denoising Diffusion Probabilistic Models (DDPM)

The DDPM framework, formalized by Ho et al. (2020) and building on earlier work by Sohl-Dickstein et al. (2015), is the foundation upon which modern diffusion models are built.

27.2.1 The Forward Process (Adding Noise)

The forward diffusion process is a fixed Markov chain that gradually adds Gaussian noise to a data sample $\mathbf{x}_0 \sim q(\mathbf{x}_0)$ over $T$ timesteps, producing a sequence of increasingly noisy versions $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$:

$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t}\,\mathbf{x}_{t-1},\; \beta_t \mathbf{I})$$

where $\beta_t \in (0, 1)$ is the variance schedule at timestep $t$. At each step, the data is scaled by $\sqrt{1 - \beta_t}$ and Gaussian noise with variance $\beta_t$ is added.

A critical property enables efficient computation of $\mathbf{x}_t$ directly from $\mathbf{x}_0$ without iterating through all intermediate steps. Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$. Then:

$$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\; (1 - \bar{\alpha}_t)\mathbf{I})$$

This means we can sample $\mathbf{x}_t$ at any timestep in closed form:

$$\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

Worked Example: Consider $T = 1000$ with a linear schedule from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$. At $t = 250$, $\bar{\alpha}_{250} \approx 0.65$, so $\mathbf{x}_{250} = 0.806\,\mathbf{x}_0 + 0.592\,\boldsymbol{\epsilon}$ — the signal is still largely visible but noticeably noisy. At $t = 750$, $\bar{\alpha}_{750} \approx 0.05$, so $\mathbf{x}_{750} = 0.224\,\mathbf{x}_0 + 0.975\,\boldsymbol{\epsilon}$ — the signal is almost entirely obscured by noise. At $t = T$, $\bar{\alpha}_T \approx 0$, and $\mathbf{x}_T$ is essentially pure Gaussian noise.

27.2.2 The Reverse Process (Removing Noise)

The reverse process aims to undo the forward diffusion, generating data from noise. If we knew the exact reverse transition $q(\mathbf{x}_{t-1} | \mathbf{x}_t)$, we could start from pure noise and iteratively denoise to obtain a clean sample. However, computing $q(\mathbf{x}_{t-1} | \mathbf{x}_t)$ requires marginalizing over the entire data distribution.

Instead, we learn a parametric approximation $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ using a neural network. The reverse process is also a Markov chain:

$$p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1};\; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\; \sigma_t^2 \mathbf{I})$$

where $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ is predicted by the neural network and $\sigma_t^2$ is typically set to $\beta_t$ or $\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t}\beta_t$ (the posterior variance).

A key mathematical result is that the true posterior $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ — the reverse step conditioned on the original clean data — is tractable and Gaussian:

$$q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1};\; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0),\; \tilde{\beta}_t \mathbf{I})$$

where:

$$\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1 - \bar{\alpha}_t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}\mathbf{x}_t$$

27.2.3 Deriving the Posterior Mean

Let us work through the derivation of the posterior $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ in detail, as it reveals why the noise prediction objective works. Using Bayes' theorem on the Markov chain:

$$q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) \cdot q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)}$$

Since the forward process is Markov, $q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) = q(\mathbf{x}_t | \mathbf{x}_{t-1})$. All three distributions on the right-hand side are Gaussian, so their product and ratio is also Gaussian. Substituting the known forms and completing the square in $\mathbf{x}_{t-1}$, we obtain the posterior mean:

$$\tilde{\boldsymbol{\mu}}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1 - \bar{\alpha}_t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}\mathbf{x}_t$$

and the posterior variance:

$$\tilde{\beta}_t = \frac{(1 - \bar{\alpha}_{t-1})}{(1 - \bar{\alpha}_t)} \cdot \beta_t$$

The intuition behind the posterior mean is illuminating. It is a weighted average of two quantities: a scaled version of the original clean image $\mathbf{x}_0$ and a scaled version of the current noisy image $\mathbf{x}_t$. Early in the reverse process (large $t$, where $\bar{\alpha}_t \approx 0$), the weight on $\mathbf{x}_0$ is small and the model must heavily rely on its noise prediction. Late in the reverse process (small $t$, where $\bar{\alpha}_t \approx 1$), the noisy image already closely resembles the clean image, and only minor corrections are needed.

Worked Example: At $t = 500$ with typical scheduling ($\bar{\alpha}_{500} \approx 0.30$, $\bar{\alpha}_{499} \approx 0.305$, $\beta_{500} \approx 0.01$): the coefficient on $\mathbf{x}_0$ is $\frac{\sqrt{0.305} \times 0.01}{0.70} \approx 0.0079$, while the coefficient on $\mathbf{x}_t$ is $\frac{\sqrt{0.99} \times 0.695}{0.70} \approx 0.988$. We see that the model relies almost entirely on the current noisy image, making only a small correction toward $\mathbf{x}_0$ at each step. This incremental denoising is what makes the process stable.

27.2.4 The Noise Prediction Objective

Rather than predicting $\boldsymbol{\mu}_\theta$ directly, Ho et al. showed that it is more effective to train the network to predict the noise $\boldsymbol{\epsilon}$ that was added. Using the reparameterization $\mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon})$, the mean can be rewritten as:

$$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right)$$

The simplified training objective becomes:

$$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2\right]$$

This is remarkably simple: sample a random timestep $t$, add noise to a training image to get $\mathbf{x}_t$, and train the network to predict what noise was added. The full training algorithm is:

  1. Sample $\mathbf{x}_0 \sim q(\mathbf{x}_0)$ (a training image)
  2. Sample $t \sim \text{Uniform}(\{1, \ldots, T\})$
  3. Sample $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
  4. Compute $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}$
  5. Take gradient step on $\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2$

27.2.4 The Sampling Algorithm

To generate an image, we start from pure noise $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and iteratively denoise:

For $t = T, T-1, \ldots, 1$: 1. Sample $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ if $t > 1$, else $\mathbf{z} = \mathbf{0}$ 2. Compute $\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}$

This requires $T$ forward passes through the neural network (typically $T = 1000$), making sampling slow — a fundamental limitation that later work addresses.

27.2.6 Alternative Prediction Targets

While noise prediction ($\boldsymbol{\epsilon}$-prediction) is the most common training objective, two alternative parameterizations have proven valuable:

$\mathbf{x}_0$-prediction: The network directly predicts the clean image $\hat{\mathbf{x}}_0 = f_\theta(\mathbf{x}_t, t)$. This is equivalent to $\boldsymbol{\epsilon}$-prediction (one can be converted to the other), but can be more stable at low noise levels where the noise is small relative to the signal. Some implementations use $\mathbf{x}_0$-prediction for display purposes during training, as intermediate denoised images are more interpretable than noise predictions.

$\mathbf{v}$-prediction (Salimans & Ho, 2022): The network predicts the "velocity" $\mathbf{v} = \sqrt{\bar{\alpha}_t}\,\boldsymbol{\epsilon} - \sqrt{1 - \bar{\alpha}_t}\,\mathbf{x}_0$. This parameterization provides a more uniform signal-to-noise ratio across timesteps and has been shown to improve sample quality in some settings, particularly for high-resolution generation. Stable Diffusion XL uses $\mathbf{v}$-prediction.

The relationship between these targets is straightforward. Given any one, the other two can be recovered:

$$\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\,\hat{\boldsymbol{\epsilon}}}{\sqrt{\bar{\alpha}_t}}, \quad \hat{\boldsymbol{\epsilon}} = \frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t}\,\hat{\mathbf{x}}_0}{\sqrt{1 - \bar{\alpha}_t}}$$

27.2.7 Complete DDPM Implementation

Here is a minimal but complete DDPM training loop in PyTorch:

import torch
import torch.nn as nn


class SimpleDDPMTrainer:
    """Minimal DDPM training loop for illustration.

    Args:
        model: The noise prediction network (U-Net).
        T: Number of diffusion timesteps.
        beta_start: Starting noise schedule value.
        beta_end: Ending noise schedule value.
        device: Device to train on.
    """

    def __init__(
        self,
        model: nn.Module,
        T: int = 1000,
        beta_start: float = 1e-4,
        beta_end: float = 0.02,
        device: str = "cuda",
    ) -> None:
        self.model = model.to(device)
        self.device = device
        self.T = T

        # Linear noise schedule
        betas = torch.linspace(beta_start, beta_end, T, device=device)
        alphas = 1.0 - betas
        alpha_bars = torch.cumprod(alphas, dim=0)

        self.betas = betas
        self.sqrt_alpha_bars = torch.sqrt(alpha_bars)
        self.sqrt_one_minus_alpha_bars = torch.sqrt(1.0 - alpha_bars)
        self.sqrt_recip_alphas = 1.0 / torch.sqrt(alphas)
        self.posterior_coeff = betas / self.sqrt_one_minus_alpha_bars

    def train_step(self, x_0: torch.Tensor) -> torch.Tensor:
        """Execute one training step.

        Args:
            x_0: Clean images of shape [batch, channels, height, width].

        Returns:
            Scalar loss value.
        """
        batch_size = x_0.shape[0]

        # Sample random timesteps
        t = torch.randint(0, self.T, (batch_size,), device=self.device)

        # Sample noise
        epsilon = torch.randn_like(x_0)

        # Create noisy images
        sqrt_ab = self.sqrt_alpha_bars[t][:, None, None, None]
        sqrt_omab = self.sqrt_one_minus_alpha_bars[t][:, None, None, None]
        x_t = sqrt_ab * x_0 + sqrt_omab * epsilon

        # Predict noise
        epsilon_pred = self.model(x_t, t)

        # MSE loss
        loss = nn.functional.mse_loss(epsilon_pred, epsilon)
        return loss

    @torch.no_grad()
    def sample(self, shape: tuple[int, ...]) -> torch.Tensor:
        """Generate samples using the DDPM sampling algorithm.

        Args:
            shape: Shape of the output (batch, channels, height, width).

        Returns:
            Generated images tensor.
        """
        x = torch.randn(shape, device=self.device)

        for t in reversed(range(self.T)):
            t_batch = torch.full(
                (shape[0],), t, device=self.device, dtype=torch.long
            )
            epsilon_pred = self.model(x, t_batch)

            coeff = self.posterior_coeff[t]
            mean = self.sqrt_recip_alphas[t] * (x - coeff * epsilon_pred)

            if t > 0:
                noise = torch.randn_like(x)
                sigma = torch.sqrt(self.betas[t])
                x = mean + sigma * noise
            else:
                x = mean

        return x

27.3 Noise Schedules

The noise schedule $\{\beta_t\}_{t=1}^T$ controls how quickly information is destroyed during the forward process and significantly impacts generation quality.

27.3.1 Linear Schedule

The original DDPM uses a linear schedule:

$$\beta_t = \beta_{\min} + \frac{t-1}{T-1}(\beta_{\max} - \beta_{\min})$$

with $\beta_{\min} = 10^{-4}$ and $\beta_{\max} = 0.02$. This produces $\bar{\alpha}_T \approx 6 \times 10^{-6}$, which is close enough to zero to ensure the forward process reaches near-pure noise.

However, the linear schedule has a problem: the signal-to-noise ratio drops too rapidly in the early timesteps and too slowly in the later ones, leading to suboptimal use of the network's capacity.

27.3.2 Cosine Schedule

Nichol and Dhariwal (2021) proposed a cosine schedule that provides a more gradual transition:

$$\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2$$

where $s = 0.008$ is a small offset to prevent $\beta_t$ from being too small near $t = 0$. The cosine schedule maintains more signal in the mid-range timesteps, which empirically improves sample quality, especially for lower-resolution images.

27.3.3 Signal-to-Noise Ratio Perspective

A unified view of noise schedules uses the signal-to-noise ratio (SNR):

$$\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}$$

The log-SNR $\lambda_t = \log(\text{SNR}(t))$ provides a natural scale for comparing schedules. A well-designed schedule should space log-SNR values uniformly across a wide range, ensuring that the network learns to denoise at all noise levels effectively.

27.3.4 Learned Schedules

Recent work has explored learning the noise schedule as part of training. Kingma et al. (2021) showed that monotonic neural networks can parameterize the noise schedule, allowing it to adapt to the specific dataset and model architecture. In practice, cosine schedules work well enough that learned schedules provide only marginal improvements.

27.3.5 Schedule Comparison: Practical Impact

To build intuition for how schedules affect generation, consider the following comparison at $t = T/2$ (halfway through the forward process):

Schedule $\bar{\alpha}_{T/2}$ Signal fraction Noise fraction Visual appearance
Linear 0.30 55% 84% Recognizable but very noisy
Cosine 0.50 71% 71% Clearly visible with moderate noise

The cosine schedule preserves more signal at the midpoint, giving the model more "useful" timesteps where it can learn to denoise meaningful structure rather than pure noise. This is particularly important for small images (e.g., 64x64) where the signal is destroyed more quickly.

Choosing a schedule in practice: For latent diffusion models operating at 64x64 latent resolution (the standard for Stable Diffusion), the linear schedule works well. For pixel-space models at 32x32 or 64x64, the cosine schedule is preferred. For high-resolution models (256x256+), a "shifted" cosine schedule with modified offset may be optimal. The general principle is that larger spatial dimensions require more aggressive noise (higher $\beta$ values) to fully corrupt the signal, while smaller dimensions need gentler noise to avoid premature signal destruction.


27.4 Score Matching and the Score-Based Perspective

An alternative mathematical framework for diffusion models connects them to score-based generative modeling, providing deeper theoretical insight and practical improvements.

27.4.1 The Score Function

The score function of a distribution $p(\mathbf{x})$ is the gradient of the log-density:

$$\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$$

The score points in the direction of increasing probability density. If we knew the score function everywhere, we could generate samples by following it using Langevin dynamics:

$$\mathbf{x}_{k+1} = \mathbf{x}_k + \frac{\delta}{2}\nabla_{\mathbf{x}} \log p(\mathbf{x}_k) + \sqrt{\delta}\,\mathbf{z}_k, \quad \mathbf{z}_k \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

As $\delta \to 0$ and the number of steps $\to \infty$, this Markov chain converges to samples from $p(\mathbf{x})$.

27.4.2 Score Matching

Directly estimating the score function is difficult because it requires knowing the normalizing constant of $p(\mathbf{x})$. Score matching (Hyvarinen, 2005) provides an alternative: we can train a neural network $\mathbf{s}_\theta(\mathbf{x})$ to approximate the score without knowing the normalization constant, by minimizing:

$$\mathcal{L}_{\text{SM}} = \mathbb{E}_{p(\mathbf{x})}\left[\frac{1}{2}\|\mathbf{s}_\theta(\mathbf{x}) - \nabla_{\mathbf{x}} \log p(\mathbf{x})\|^2\right]$$

This can be shown to be equivalent (up to a constant) to:

$$\mathcal{L}_{\text{SM}} = \mathbb{E}_{p(\mathbf{x})}\left[\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}_\theta(\mathbf{x})) + \frac{1}{2}\|\mathbf{s}_\theta(\mathbf{x})\|^2\right]$$

27.4.3 Denoising Score Matching

Computing the trace of the Jacobian in the score matching objective is expensive. Vincent (2011) showed that we can instead train a score network on noisy data. Given a noise kernel $q_\sigma(\tilde{\mathbf{x}} | \mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2 \mathbf{I})$, denoising score matching trains:

$$\mathcal{L}_{\text{DSM}} = \mathbb{E}_{q(\mathbf{x})q_\sigma(\tilde{\mathbf{x}}|\mathbf{x})}\left[\|\mathbf{s}_\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q_\sigma(\tilde{\mathbf{x}} | \mathbf{x})\|^2\right]$$

Since $\nabla_{\tilde{\mathbf{x}}} \log q_\sigma(\tilde{\mathbf{x}} | \mathbf{x}) = -\frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^2} = -\frac{\boldsymbol{\epsilon}}{\sigma}$, this is equivalent to predicting the noise — exactly the DDPM objective.

27.4.4 The SDE Framework

Song et al. (2021) unified DDPM and score-based models by formulating diffusion as a continuous-time stochastic differential equation (SDE):

$$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w}$$

where $\mathbf{f}$ is the drift coefficient, $g$ is the diffusion coefficient, and $\mathbf{w}$ is a standard Wiener process. The corresponding reverse-time SDE is:

$$d\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})]\,dt + g(t)\,d\bar{\mathbf{w}}$$

This formulation reveals that learning the score function $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ at all noise levels is sufficient to reverse the diffusion process. It also enables the use of general-purpose SDE solvers for sampling, opening the door to faster and more flexible generation strategies.

For the DDPM forward process, the SDE takes the Variance-Preserving (VP) form:

$$d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x}\,dt + \sqrt{\beta(t)}\,d\mathbf{w}$$

An alternative is the Variance-Exploding (VE) form, corresponding to SMLD (Score Matching with Langevin Dynamics):

$$d\mathbf{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}}\,d\mathbf{w}$$

The SDE framework also reveals a corresponding probability flow ODE — a deterministic ordinary differential equation that generates the same marginal distributions as the SDE:

$$d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - \frac{1}{2}g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})\right]dt$$

This ODE is particularly important because it enables: 1. Exact likelihood computation: The change of variables formula gives exact log-likelihoods, unlike the ELBO used in VAEs. 2. Adaptive step-size solvers: Standard ODE solvers (Euler, Runge-Kutta, DPM-Solver) can be used for efficient sampling with adaptive step sizes. 3. Deterministic sampling: The same initial noise always produces the same output, enabling reproducibility and interpolation.

The probability flow ODE is the theoretical foundation for DDIM, DPM-Solver, and other deterministic samplers discussed in the next section.


27.5 DDIM: Accelerated Sampling

The primary weakness of DDPM is sampling speed: generating a single image requires 1000 sequential denoising steps. Denoising Diffusion Implicit Models (DDIM), introduced by Song et al. (2020), provide a solution by reformulating the reverse process.

27.5.1 From Stochastic to Deterministic Sampling

DDIM observes that the DDPM training objective does not depend on the specific reverse process — only on the marginals $q(\mathbf{x}_t | \mathbf{x}_0)$. This means we can define a family of reverse processes that all share the same marginals but differ in their stochasticity.

The DDIM update rule is:

$$\mathbf{x}_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\underbrace{\left(\frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{\bar{\alpha}_t}}\right)}_{\text{predicted } \mathbf{x}_0} + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) + \sigma_t\,\boldsymbol{\epsilon}_t$$

where $\sigma_t$ controls the stochasticity: - $\sigma_t = \sqrt{\frac{(1-\bar{\alpha}_{t-1})}{(1-\bar{\alpha}_t)}\beta_t}$ recovers DDPM - $\sigma_t = 0$ gives the deterministic DDIM sampler

27.5.2 Subsampling Timesteps

Because DDIM does not assume a Markov chain, we can subsample the timestep sequence. Instead of using all $T$ steps, we select a subsequence $\tau_1 < \tau_2 < \cdots < \tau_S$ where $S \ll T$. With $S = 50$ steps, DDIM produces samples nearly indistinguishable from 1000-step DDPM, achieving a 20x speedup.

27.5.3 Deterministic Encoding

When $\sigma_t = 0$, the DDIM process is deterministic, meaning the mapping from $\mathbf{x}_T$ to $\mathbf{x}_0$ is a fixed function. This enables:

  • Meaningful latent space: The noise $\mathbf{x}_T$ becomes a meaningful encoding of the image.
  • Interpolation: Interpolating between two noise vectors produces semantically meaningful blends.
  • Inversion: Given an image, we can find its corresponding noise vector by running the forward DDIM process, enabling editing of real images.

27.5.4 Practical Impact of DDIM

The deterministic property of DDIM with $\sigma_t = 0$ has profound practical implications. Because the mapping from noise to image is a fixed function, two users who start with the same noise vector and text prompt will produce identical images. This enables:

  • Reproducibility: Setting a random seed produces the exact same output every time, critical for production pipelines.
  • Seed exploration: Users can generate many images by varying only the seed, finding compositions they like, then refining with prompt changes while keeping the same seed.
  • Consistent style: For batch generation (e.g., generating a set of product images), using the same seed with different prompts produces images with consistent style and composition.

27.5.5 Later Advances in Fast Sampling

DDIM opened the door to numerous fast sampling methods:

  • DPM-Solver (Lu et al., 2022): Uses exponential integrator methods to achieve high quality in 10-20 steps. Treats the diffusion ODE as a semi-linear ODE and applies high-order numerical methods. DPM-Solver++ with 20 steps often outperforms DDIM with 50 steps.
  • Progressive Distillation (Salimans & Ho, 2022): Trains a student model to match two teacher steps in one step, repeatedly halving the number of steps. After 4 rounds of distillation, the model generates in $1000 / 2^4 = 62$ steps, then further rounds reach 4-8 steps.
  • Consistency Models (Song et al., 2023): Learn to map any point on the diffusion trajectory directly to $\mathbf{x}_0$, enabling single-step generation. The key constraint is self-consistency: for any two points on the same trajectory, the model should map both to the same output.

The following table summarizes the quality-speed tradeoff for a typical Stable Diffusion model:

Method Steps Time (A100) FID (COCO) Notes
DDPM 1000 ~60s Baseline Original, too slow
DDIM 50 ~3s +0.5 Standard fast sampler
DPM-Solver++ 20 ~1.2s +0.3 Best quality per step
LCM 4 ~0.3s +2.0 Latent consistency
SDXL-Turbo 1 ~0.1s +5.0 Adversarial distillation

27.6 The U-Net Architecture for Diffusion

The neural network at the heart of a diffusion model must take a noisy image $\mathbf{x}_t$ and a timestep $t$ as input and predict the noise $\boldsymbol{\epsilon}$. The U-Net architecture, originally designed for medical image segmentation, has become the standard choice.

27.6.1 U-Net Basics

The U-Net follows an encoder-decoder structure with skip connections:

  1. Encoder (downsampling path): A series of convolutional blocks that progressively reduce spatial resolution while increasing channel depth. Each block typically consists of two convolutional layers with group normalization and SiLU (Swish) activation, followed by downsampling (strided convolution or average pooling).

  2. Bottleneck: The lowest-resolution representation, where the most abstract features are processed.

  3. Decoder (upsampling path): A symmetric series of blocks that progressively increase spatial resolution. Each block receives skip connections from the corresponding encoder block, concatenating features to preserve fine-grained spatial information.

27.6.2 Timestep Conditioning

The U-Net must know which noise level it is denoising. Timestep information is injected through sinusoidal position embeddings (similar to those in the Transformer), which are then projected through a two-layer MLP and added to intermediate feature maps via adaptive group normalization or simple addition.

The sinusoidal embedding for timestep $t$ is:

$$\text{emb}(t)_{2i} = \sin(t / 10000^{2i/d}), \quad \text{emb}(t)_{2i+1} = \cos(t / 10000^{2i/d})$$

This provides the network with a smooth, continuous representation of the noise level.

27.6.3 Attention Layers

Modern diffusion U-Nets incorporate self-attention layers at one or more resolution levels (typically at 16x16 and 8x8 resolution). These allow the network to model long-range dependencies within the image, which is crucial for generating coherent global structure.

Each attention block computes multi-head self-attention over the spatial dimensions:

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

where $\mathbf{Q}$, $\mathbf{K}$, $\mathbf{V}$ are linear projections of the feature map reshaped to a sequence.

27.6.4 Architecture Details

A typical diffusion U-Net for 256x256 images has:

  • Channel progression: 128 -> 256 -> 512 -> 512 (doubling at each downsample)
  • Resolution progression: 256 -> 128 -> 64 -> 32 -> 16 -> 8
  • Residual blocks per resolution: 2-3
  • Attention at resolutions: 32, 16, 8
  • Total parameters: ~350M for Improved DDPM

Why attention is used only at lower resolutions: Self-attention has $O(N^2)$ complexity where $N$ is the number of spatial positions. At 256x256 resolution, $N = 65{,}536$ — far too large for attention. At 16x16, $N = 256$, which is manageable. This is analogous to the Swin Transformer's window attention strategy (Section 26.5.3), where computational cost is controlled by limiting the attention scope.

Skip connections are crucial: The U-Net's skip connections between encoder and decoder at matching resolutions are essential for preserving fine-grained spatial details. Without them, the bottleneck would discard the high-frequency information needed to produce sharp images. This is the same principle that makes U-Nets effective for image segmentation — the encoder captures "what," and the skip connections preserve "where."


27.7 Latent Diffusion Models

Running diffusion in pixel space is computationally expensive — denoising a 512x512x3 image requires processing 786,432 values at each of the many denoising steps. Latent Diffusion Models (LDMs), introduced by Rombach et al. (2022), solve this by performing diffusion in a compressed latent space.

27.7.1 The Key Insight

Images contain enormous amounts of perceptual redundancy. A high-quality autoencoder can compress a 512x512x3 image into a 64x64x4 latent representation — a 48x reduction in dimensionality — while preserving nearly all perceptually relevant information. Performing diffusion in this latent space is dramatically more efficient.

27.7.2 The Autoencoder

LDMs use a Variational Autoencoder with a KL-regularized or VQ-regularized latent space:

Encoder: $\mathbf{z} = \mathcal{E}(\mathbf{x})$ maps an image $\mathbf{x} \in \mathbb{R}^{H \times W \times 3}$ to a latent $\mathbf{z} \in \mathbb{R}^{h \times w \times c}$, where $h = H/f$, $w = W/f$, and $f$ is the downsampling factor (typically $f = 8$).

Decoder: $\hat{\mathbf{x}} = \mathcal{D}(\mathbf{z})$ reconstructs the image from the latent.

The autoencoder is trained separately from the diffusion model, using a combination of: - Reconstruction loss: $L_1$ or $L_2$ pixel loss - Perceptual loss: LPIPS (Learned Perceptual Image Patch Similarity) using a pre-trained VGG network - Adversarial loss: A patch-based discriminator to encourage sharp reconstructions - KL regularization: A small KL penalty to keep the latent space well-structured

27.7.3 Latent Diffusion Training

Once the autoencoder is trained, the diffusion model operates entirely in latent space:

  1. Encode training image: $\mathbf{z}_0 = \mathcal{E}(\mathbf{x})$
  2. Add noise: $\mathbf{z}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{z}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}$
  3. Train the U-Net: $\mathcal{L} = \|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t)\|^2$

At inference time, we sample noise in latent space, denoise, and then decode: 1. $\mathbf{z}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ 2. Reverse diffusion in latent space: $\mathbf{z}_T \to \mathbf{z}_0$ 3. Decode: $\mathbf{x} = \mathcal{D}(\mathbf{z}_0)$

27.7.4 Computational Advantages

The benefits of latent diffusion are substantial:

Aspect Pixel Diffusion (512x512) Latent Diffusion (64x64x4)
Tensor size 512x512x3 = 786K 64x64x4 = 16K
U-Net FLOPs per step ~500 GFLOPs ~30 GFLOPs
Training GPU-hours ~1000 ~200
VRAM for training ~40 GB ~10 GB

This 16-50x efficiency improvement makes training high-resolution diffusion models feasible on academic-scale hardware.


27.8 The Stable Diffusion Architecture

Stable Diffusion, developed by CompVis, Stability AI, and Runway, is the most widely deployed open-source diffusion model. It is a latent diffusion model with text conditioning, consisting of three main components.

27.8.1 Component 1: The Variational Autoencoder (VAE)

Stable Diffusion's VAE uses a downsampling factor of $f = 8$:

  • Encoder: Takes a 512x512x3 image and produces a 64x64x8 feature map (mean and log-variance for each of 4 latent channels). The latent is sampled as $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$.
  • Decoder: Takes a 64x64x4 latent and reconstructs a 512x512x3 image.

The VAE architecture uses ResNet blocks with self-attention at the bottleneck resolution. The KL weight is very small ($\sim 10^{-6}$), resulting in a nearly deterministic autoencoder with high reconstruction quality (PSNR > 30 dB on typical images).

27.8.2 Component 2: The U-Net with Cross-Attention

The U-Net in Stable Diffusion is augmented with cross-attention layers that enable text conditioning:

Within each U-Net block: 1. Residual convolutional block with timestep embedding 2. Self-attention layer (spatial features attend to themselves) 3. Cross-attention layer (spatial features attend to text embeddings) 4. Feed-forward network

The cross-attention mechanism works as follows: - Queries: Projected from the U-Net feature map $\mathbf{Q} = \mathbf{W}^Q \mathbf{h}$ - Keys and Values: Projected from the text encoder output $\mathbf{K} = \mathbf{W}^K \mathbf{c}$, $\mathbf{V} = \mathbf{W}^V \mathbf{c}$

$$\text{CrossAttention}(\mathbf{h}, \mathbf{c}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

This allows each spatial location in the feature map to "look at" the text prompt and determine what should be generated at that location.

The Stable Diffusion v1.5 U-Net has approximately 860 million parameters.

27.8.3 Component 3: The Text Encoder

Stable Diffusion v1.x uses CLIP's text encoder (a 12-layer transformer with 123M parameters) to convert text prompts into conditioning vectors. The text is tokenized using a BPE tokenizer with a vocabulary of 49,408 tokens and a maximum sequence length of 77.

The text encoder produces a sequence of hidden states $\mathbf{c} \in \mathbb{R}^{77 \times 768}$, which serve as the keys and values for cross-attention in the U-Net. Importantly, it is the full sequence of token embeddings (not just the [EOS] pooled output) that conditions generation, allowing fine-grained alignment between words and spatial regions.

Stable Diffusion XL (SDXL) uses two text encoders: CLIP ViT-L/14 and OpenCLIP ViT-bigG/14, concatenating their outputs for a richer text representation of dimension $768 + 1280 = 2048$. The dual text encoder provides complementary representations: CLIP ViT-L captures fine-grained semantic details while ViT-bigG provides stronger compositional understanding.

How text conditions spatial generation: The cross-attention mechanism is the key to understanding how text controls what appears where in the generated image. Each word token in the text produces a key and value vector. Each spatial position in the U-Net's feature map produces a query vector. The attention scores between spatial queries and text keys determine how strongly each spatial location "listens to" each word.

Visualizing these cross-attention maps reveals that spatial locations develop strong affinities for semantically relevant words. For the prompt "a red ball on a blue table," the spatial positions that will become the ball strongly attend to "red" and "ball," while the table region attends to "blue" and "table." This spatial alignment emerges naturally from training and is what enables precise semantic control over generation.

This attention-based text conditioning is the same mechanism used in the cross-attention layers of multimodal models (Chapter 28, Section 28.7.3), demonstrating the universality of the cross-attention pattern for conditioning one modality on another.

27.8.4 The Complete Pipeline

The full text-to-image generation pipeline operates as follows:

  1. Text encoding: Convert the prompt to embeddings $\mathbf{c} = \text{TextEncoder}(\text{prompt})$
  2. Initialize noise: $\mathbf{z}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ in latent space (64x64x4)
  3. Iterative denoising: For each timestep $t = T, T-1, \ldots, 1$: - Predict noise: $\hat{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c})$ - Apply scheduler step: $\mathbf{z}_{t-1} = \text{step}(\mathbf{z}_t, \hat{\boldsymbol{\epsilon}}, t)$
  4. Decode: $\mathbf{x} = \mathcal{D}(\mathbf{z}_0)$

27.9 Classifier-Free Guidance

Classifier-free guidance (Ho & Salimans, 2022) is the single most important technique for controlling diffusion model output quality. Without it, text-conditioned diffusion models produce images that only loosely match the prompt.

27.9.1 The Motivation

During training, the diffusion model learns $p(\mathbf{x} | \mathbf{c})$ — the distribution of images given a text condition. But what we actually want during sampling is to amplify the effect of the conditioning, generating images that are more strongly aligned with the text.

The idea comes from classifier guidance (Dhariwal & Nichol, 2021), which used a separate classifier to steer generation. The gradient of a classifier's log-probability $\nabla_{\mathbf{x}} \log p(c | \mathbf{x})$ was added to the diffusion model's score estimate, effectively pushing generation toward images that a classifier would confidently label as class $c$.

27.9.2 The Classifier-Free Approach

Classifier-free guidance eliminates the need for a separate classifier. During training, the text condition $\mathbf{c}$ is randomly dropped (replaced with an empty/null embedding $\varnothing$) with some probability (typically 10-20%). This trains the model for both conditional and unconditional generation:

  • Conditional prediction: $\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c})$
  • Unconditional prediction: $\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \varnothing)$

During sampling, the guided prediction is a linear extrapolation:

$$\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \varnothing) + w \cdot (\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c}) - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \varnothing))$$

where $w$ is the guidance scale: - $w = 1$: Standard conditional sampling (no guidance) - $w > 1$: Amplified conditioning — images more closely match the text - $w = 7.5$: Typical value for Stable Diffusion - $w = 0$: Unconditional generation

27.9.3 The Quality-Diversity Tradeoff

Increasing the guidance scale improves text-image alignment and perceived quality but reduces diversity. This is analogous to the temperature parameter in language models:

Guidance Scale Effect
1.0 Maximum diversity, weak text alignment
3.0-5.0 Good balance for artistic generation
7.0-8.5 Strong text alignment, standard for most use cases
10.0-15.0 Very strong alignment, may produce artifacts
20.0+ Oversaturated, distorted outputs

27.9.4 Mathematical Basis for Classifier-Free Guidance

To understand why classifier-free guidance works, we can derive it from the score-based perspective. The score of the conditional distribution can be decomposed using Bayes' theorem:

$$\nabla_{\mathbf{x}} \log p(\mathbf{x} | \mathbf{c}) = \nabla_{\mathbf{x}} \log p(\mathbf{x}) + \nabla_{\mathbf{x}} \log p(\mathbf{c} | \mathbf{x})$$

Classifier guidance amplifies the second term with a scale factor $w$:

$$\tilde{\nabla}_{\mathbf{x}} \log p(\mathbf{x} | \mathbf{c}) = \nabla_{\mathbf{x}} \log p(\mathbf{x}) + w \cdot \nabla_{\mathbf{x}} \log p(\mathbf{c} | \mathbf{x})$$

Rearranging:

$$\tilde{\nabla}_{\mathbf{x}} \log p(\mathbf{x} | \mathbf{c}) = (1 - w) \nabla_{\mathbf{x}} \log p(\mathbf{x}) + w \cdot \nabla_{\mathbf{x}} \log p(\mathbf{x} | \mathbf{c})$$

Since the noise prediction $\boldsymbol{\epsilon}_\theta$ is proportional to the negative score, this translates directly to the classifier-free guidance formula. The "classifier-free" aspect is that we never need an explicit classifier — instead, we estimate $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ using the unconditional model and $\nabla_{\mathbf{x}} \log p(\mathbf{x} | \mathbf{c})$ using the conditional model, both of which are learned by the same network through conditional dropout.

Worked Example: Consider a Stable Diffusion model with guidance scale $w = 7.5$ generating "a castle on a hilltop." At each denoising step: 1. Run the U-Net with the text condition: $\boldsymbol{\epsilon}_c = \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \text{"a castle on a hilltop"})$ 2. Run the U-Net without condition: $\boldsymbol{\epsilon}_u = \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \varnothing)$ 3. Compute the guided noise: $\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_u + 7.5 \times (\boldsymbol{\epsilon}_c - \boldsymbol{\epsilon}_u)$

The difference $\boldsymbol{\epsilon}_c - \boldsymbol{\epsilon}_u$ captures the "direction" that makes the image more castle-like. Multiplying by 7.5 amplifies this direction, producing images that more strongly match the prompt at the cost of reduced diversity.

27.9.5 Implementation Cost

Classifier-free guidance requires two forward passes through the U-Net at each denoising step — one conditional and one unconditional. This doubles the computational cost of sampling. In practice, the two predictions can be batched together for better GPU utilization, but the total FLOPs still double.

27.9.6 Negative Prompts

Negative prompts are a practical extension of classifier-free guidance. Instead of using an empty embedding $\varnothing$ for the unconditional prediction, we use a "negative prompt" that describes what we do not want to see:

$$\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c}_{\text{neg}}) + w \cdot (\boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c}_{\text{pos}}) - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c}_{\text{neg}}))$$

Common negative prompts include "blurry, low quality, deformed, ugly" to steer the model away from low-quality outputs. This adds no additional computational cost since it simply replaces the unconditional prediction.


27.10 ControlNet and Controlled Generation

While text prompts provide semantic control over generation, they offer limited control over spatial structure. ControlNet (Zhang & Agrawala, 2023) addresses this by enabling precise spatial conditioning using additional input modalities.

27.10.1 The Control Problem

Consider generating "a photograph of a house." A text prompt specifies what to generate but not where or how. For practical applications — architectural visualization, pose-guided character generation, or edge-to-image translation — we need spatial control signals such as edge maps, depth maps, pose skeletons, or segmentation maps.

27.10.2 ControlNet Architecture

ControlNet adds spatial conditioning to a pre-trained Stable Diffusion model without modifying the original weights. The architecture consists of:

  1. Locked copy: The original pre-trained U-Net encoder, whose weights are frozen.
  2. Trainable copy: A copy of the U-Net encoder that processes the control signal.
  3. Zero convolutions: 1x1 convolutional layers initialized to zero that connect the trainable copy's outputs to the locked model's skip connections.

The zero-initialization is critical: at the beginning of training, the ControlNet adds nothing to the pre-trained model (because the zero convolutions output zeros), preserving the model's original capabilities. As training progresses, the zero convolutions learn non-zero weights, gradually incorporating the control signal.

27.10.3 Control Modalities

ControlNet has been trained for various control signals:

  • Canny edges: Precise edge-based control for structural guidance
  • Depth maps: Control the 3D structure and scene layout
  • OpenPose skeletons: Control human body poses
  • Semantic segmentation: Assign specific regions to specific categories
  • Normal maps: Control surface orientation for detailed 3D structure
  • M-LSD lines: Control architectural lines and vanishing points
  • Scribble/sketch: Loose artistic guidance from rough drawings

27.10.4 Multi-ControlNet

Multiple ControlNet models can be combined to apply several control signals simultaneously. For example, using both a depth map and a pose skeleton to generate a person with specific positioning in a specific environment. The outputs from multiple ControlNets are simply added together before being injected into the main U-Net.

27.10.5 IP-Adapter and Image Conditioning

Beyond structural control, IP-Adapter (Ye et al., 2023) enables image-based style and content conditioning by projecting CLIP image embeddings into the cross-attention layers of the U-Net, allowing generation to be guided by reference images rather than (or in addition to) text prompts.


27.11 Advanced Topics in Diffusion Models

27.11.1 Image-to-Image Translation

Diffusion models naturally support image-to-image translation by starting the reverse process from a partially noised version of an input image rather than pure noise:

  1. Encode the input image: $\mathbf{z}_0 = \mathcal{E}(\mathbf{x}_{\text{input}})$
  2. Add noise to timestep $t_{\text{start}} < T$: $\mathbf{z}_{t_{\text{start}}} = \sqrt{\bar{\alpha}_{t_{\text{start}}}}\,\mathbf{z}_0 + \sqrt{1 - \bar{\alpha}_{t_{\text{start}}}}\,\boldsymbol{\epsilon}$
  3. Denoise from $t_{\text{start}}$ to 0 with the new text prompt

The parameter $t_{\text{start}}$ (controlled by a "strength" parameter between 0 and 1) determines how much of the input image is preserved. Higher strength means more noise and more creative freedom; lower strength preserves more of the original structure.

27.11.2 Inpainting

Inpainting fills in masked regions of an image while keeping the unmasked areas intact. Specialized inpainting models take the masked image and mask as additional inputs to the U-Net. During sampling, the unmasked region is replaced with the noised original image at each step, forcing the model to generate content that seamlessly blends with the surrounding context.

27.11.3 Upscaling and Super-Resolution

Diffusion models excel at super-resolution — generating high-frequency details that are consistent with a low-resolution input. Models like Stable Diffusion Upscaler take a low-resolution image, add noise, and denoise it at a higher resolution, adding realistic fine details guided by the text prompt.

27.11.4 Practical Generation Pipeline with Diffusers

The HuggingFace diffusers library provides a high-level interface for working with diffusion models. Here is a comprehensive example covering text-to-image generation, image-to-image translation, and inpainting:

import torch
from diffusers import (
    StableDiffusionPipeline,
    StableDiffusionImg2ImgPipeline,
    StableDiffusionInpaintPipeline,
    DPMSolverMultistepScheduler,
)
from PIL import Image


def text_to_image(
    prompt: str,
    negative_prompt: str = "blurry, low quality",
    model_id: str = "stabilityai/stable-diffusion-2-1",
    num_steps: int = 25,
    guidance_scale: float = 7.5,
    height: int = 512,
    width: int = 512,
    seed: int = 42,
) -> Image.Image:
    """Generate an image from a text prompt.

    Args:
        prompt: Text description of desired image.
        negative_prompt: Text describing undesired qualities.
        model_id: HuggingFace model identifier.
        num_steps: Number of denoising steps.
        guidance_scale: Classifier-free guidance scale.
        height: Output image height.
        width: Output image width.
        seed: Random seed for reproducibility.

    Returns:
        Generated PIL Image.
    """
    pipe = StableDiffusionPipeline.from_pretrained(
        model_id, torch_dtype=torch.float16
    )
    # Use DPM-Solver for fast, high-quality sampling
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(
        pipe.scheduler.config
    )
    pipe = pipe.to("cuda")

    generator = torch.Generator("cuda").manual_seed(seed)

    result = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=num_steps,
        guidance_scale=guidance_scale,
        height=height,
        width=width,
        generator=generator,
    )
    return result.images[0]


def image_to_image(
    init_image: Image.Image,
    prompt: str,
    strength: float = 0.75,
    model_id: str = "stabilityai/stable-diffusion-2-1",
    num_steps: int = 25,
    guidance_scale: float = 7.5,
) -> Image.Image:
    """Transform an existing image using a text prompt.

    Args:
        init_image: The input PIL Image to transform.
        prompt: Text describing the desired transformation.
        strength: How much to transform (0=no change, 1=full).
        model_id: HuggingFace model identifier.
        num_steps: Number of denoising steps.
        guidance_scale: Classifier-free guidance scale.

    Returns:
        Transformed PIL Image.
    """
    pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
        model_id, torch_dtype=torch.float16
    )
    pipe = pipe.to("cuda")

    result = pipe(
        prompt=prompt,
        image=init_image,
        strength=strength,
        num_inference_steps=num_steps,
        guidance_scale=guidance_scale,
    )
    return result.images[0]

Memory optimization tips: For consumer GPUs with limited VRAM, several techniques can reduce memory usage dramatically:

# Enable attention slicing (trades speed for memory)
pipe.enable_attention_slicing()

# Enable VAE slicing for large batch sizes
pipe.enable_vae_slicing()

# Use sequential CPU offloading (slowest but uses least GPU memory)
pipe.enable_sequential_cpu_offload()

# Use model CPU offloading (faster than sequential)
pipe.enable_model_cpu_offload()

# Use xFormers memory-efficient attention (if installed)
pipe.enable_xformers_memory_efficient_attention()

With these optimizations, Stable Diffusion can run on GPUs with as little as 4 GB VRAM, making it accessible to a wide audience.

27.11.5 Distillation and Few-Step Models

The quest for faster generation has led to several distillation approaches:

  • Latent Consistency Models (LCM): Distill the full diffusion trajectory into a model that generates high-quality images in 4-8 steps.
  • SDXL-Turbo and SD-Turbo: Use adversarial training combined with diffusion to enable 1-4 step generation.
  • Lightning and Hyper-SD: Apply progressive distillation for high-quality fast generation.

These models maintain remarkable quality despite the dramatic reduction in sampling steps.

27.11.5 Video Diffusion

Extending diffusion models to video generation requires modeling temporal consistency. Architectures like Stable Video Diffusion add temporal attention layers (attention across frames) to the U-Net, interleaved with the existing spatial attention layers. The forward process adds noise independently to each frame, and the temporal attention layers learn to generate temporally coherent sequences.


27.12 Training Diffusion Models: Practical Considerations

27.12.1 Dataset Preparation

Training data quality is paramount for diffusion models:

  • Resolution: Images should be resized and center-cropped (or randomly cropped during training) to the target resolution. Aspect ratio bucketing groups images by aspect ratio to avoid distortion.
  • Captions: For text-conditioned models, each image needs a text description. BLIP or CogVLM can generate synthetic captions for unlabeled datasets.
  • Filtering: Remove low-quality images (blurry, watermarked, too small) and potentially harmful content.

27.12.2 Training Hyperparameters

Typical training configurations for latent diffusion:

  • Optimizer: AdamW with $\beta_1 = 0.9$, $\beta_2 = 0.999$, weight decay = 0.01
  • Learning rate: $10^{-4}$ to $10^{-5}$, often with linear warmup and cosine decay
  • Batch size: As large as memory permits (256-2048 with gradient accumulation)
  • EMA: Exponential moving average of weights with decay 0.9999
  • Mixed precision: FP16 or BF16 for memory efficiency
  • Gradient clipping: Clip to max norm 1.0

27.12.3 Fine-Tuning Pre-Trained Models

Several techniques enable efficient fine-tuning of pre-trained diffusion models:

  • DreamBooth: Fine-tunes the entire model (or LoRA adapters) on a small set of images (3-5) of a specific subject, binding it to a unique identifier token.
  • LoRA (Low-Rank Adaptation): Adds trainable low-rank matrices to the cross-attention and self-attention layers, requiring only 1-10MB of additional parameters.
  • Textual Inversion: Learns a new embedding vector for a concept while keeping the model frozen. Only a single 768-dimensional vector is learned, making this the most parameter-efficient approach but also the least expressive.

LoRA in detail: LoRA (as discussed in more general terms in earlier chapters) is particularly well-suited to diffusion models. For each attention weight matrix $\mathbf{W} \in \mathbb{R}^{d \times d}$, LoRA adds a low-rank update $\Delta\mathbf{W} = \mathbf{B}\mathbf{A}$ where $\mathbf{B} \in \mathbb{R}^{d \times r}$ and $\mathbf{A} \in \mathbb{R}^{r \times d}$ with rank $r \ll d$ (typically $r = 4$ to $r = 64$). For Stable Diffusion with $d = 320$ to $d = 1280$ attention layers, a rank-4 LoRA adds about 1MB of trainable parameters to the 860M U-Net — a 0.1% increase. Yet this is sufficient to adapt the model's style, subject matter, or composition preferences.

DreamBooth vs. LoRA: DreamBooth fine-tunes the full model (or a LoRA adapter) on 3-5 images of a specific subject, binding the subject to a unique identifier token (e.g., "[V] dog" for your specific dog). The training objective combines the standard diffusion loss with a prior preservation loss that prevents the model from "forgetting" what a generic dog looks like. Typical training takes 800-1200 steps on a single GPU with a learning rate of $10^{-6}$ for full fine-tuning or $10^{-4}$ for LoRA.

27.12.4 Evaluation Metrics

Evaluating generative models requires specialized metrics:

  • FID (Frechet Inception Distance): Measures the distance between the distributions of generated and real images in Inception feature space. Lower is better. Standard benchmark: FID on 50K generated samples vs. the training set.

The FID is computed as:

$$\text{FID} = \|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g\|^2 + \text{Tr}(\boldsymbol{\Sigma}_r + \boldsymbol{\Sigma}_g - 2(\boldsymbol{\Sigma}_r \boldsymbol{\Sigma}_g)^{1/2})$$

where $(\boldsymbol{\mu}_r, \boldsymbol{\Sigma}_r)$ and $(\boldsymbol{\mu}_g, \boldsymbol{\Sigma}_g)$ are the mean and covariance of Inception-v3 features for real and generated images, respectively.

Worked Example: If the real image features have mean $\boldsymbol{\mu}_r = [1.0, 2.0]$ and the generated features have mean $\boldsymbol{\mu}_g = [1.1, 2.2]$, the first term is $0.01 + 0.04 = 0.05$. The covariance terms capture how well the model reproduces the diversity of the data. An FID of 0 indicates identical distributions.

Typical FID values: - State-of-the-art unconditional (ImageNet 256x256): ~2-5 - Stable Diffusion text-to-image (COCO): ~10-15 - Amateur GAN: ~50-100

  • CLIP Score: Measures text-image alignment using CLIP embeddings (as we discussed in Chapter 28). Higher is better.
  • IS (Inception Score): Measures both quality and diversity. Higher is better but has known limitations — it can be fooled by mode dropping (generating one perfect image per class).
  • Human evaluation: Still the gold standard; Amazon Mechanical Turk studies compare methods via pairwise preferences. Common protocols include side-by-side comparisons ("which image better matches the prompt?") and absolute quality ratings.

27.13 Ethical Considerations and Limitations

Diffusion models are trained on large-scale web-scraped datasets (LAION-5B for Stable Diffusion), raising questions about the use of copyrighted images for training. Ongoing legal cases (e.g., Getty Images v. Stability AI) are shaping the legal landscape.

27.13.2 Deepfakes and Misinformation

The ability to generate photorealistic images poses risks for misinformation. Watermarking techniques (visible and invisible) and AI-generated content detection models are being developed as countermeasures.

27.13.3 Bias and Representation

Training data biases propagate to generated content. Models may over-represent certain demographics, reinforce stereotypes, or produce biased associations. Safety filters and careful dataset curation are partial mitigations.

27.13.4 Environmental Impact

Training large diffusion models requires significant computational resources. Stable Diffusion v1 required approximately 150,000 A100-hours of compute. Latent diffusion provides substantial savings compared to pixel-space diffusion, and efficient fine-tuning methods (LoRA, DreamBooth) democratize customization.

27.13.5 Safety Measures

Responsible deployment of diffusion models requires multiple safety layers:

  1. NSFW classifiers: Stable Diffusion includes a safety checker that flags potentially inappropriate generated content by comparing CLIP embeddings against known NSFW concept embeddings.

  2. Prompt filtering: Block generation requests containing harmful keywords or phrases through both static keyword lists and learned classifiers.

  3. Invisible watermarking: Embed imperceptible watermarks in generated images to enable later identification as AI-generated. Methods include modifying the initial noise distribution in a detectable way or adding watermarks in the latent space before VAE decoding.

  4. Model cards and usage policies: Document the model's capabilities, limitations, and intended use cases. The Stable Diffusion model card explicitly prohibits generating content that exploits or harms individuals.

These measures are imperfect — determined actors can bypass them — but they represent important steps toward responsible AI-generated content. The balance between accessibility and safety remains an active area of debate in the AI community.


27.14 The Diffusion Transformer (DiT) Architecture

While U-Nets have been the standard backbone for diffusion models, the Diffusion Transformer (DiT) architecture (Peebles & Xie, 2023) replaces the U-Net entirely with a transformer, following the broader trend of transformer adoption across domains, as we saw in Chapter 26 for vision.

27.14.1 Architecture Design

DiT operates on sequences of latent patches (similar to ViT operating on image patches). The architecture consists of:

  1. Patch embedding: Divide the latent $\mathbf{z} \in \mathbb{R}^{h \times w \times c}$ into patches and linearly embed them.
  2. Transformer blocks with adaptive layer norm (adaLN-Zero): Standard transformer blocks where the scale and shift parameters of layer normalization are predicted from the timestep and class conditioning, rather than being fixed learned parameters.
  3. Final linear layer: Predicts the noise and diagonal covariance for each patch.

The adaLN-Zero conditioning mechanism works as follows:

$$\text{adaLN}(\mathbf{h}, \mathbf{c}) = \gamma(\mathbf{c}) \odot \text{LN}(\mathbf{h}) + \beta(\mathbf{c})$$

where $\gamma(\mathbf{c})$ and $\beta(\mathbf{c})$ are predicted from the conditioning vector $\mathbf{c}$ (which encodes both timestep and class information). The "Zero" refers to initializing the final layer's parameters to zero, similar to ControlNet's zero convolution strategy, ensuring the model starts by predicting zero noise.

27.14.2 Scaling Properties

DiT demonstrates clear scaling laws: larger models and more training compute consistently produce lower FID scores. DiT-XL/2 (675M parameters) achieves state-of-the-art FID on ImageNet 256x256, surpassing all prior U-Net-based diffusion models. This finding has motivated the adoption of transformer backbones in subsequent large-scale systems, most notably OpenAI's Sora for video generation (discussed in Chapter 30).

27.15 Exercises

  1. Forward process visualization: Implement the DDPM forward process for a single image from CIFAR-10. Visualize the image at timesteps $t \in \{0, 100, 250, 500, 750, 1000\}$ using both the linear and cosine schedules. At which timestep does each schedule reach a visually indistinguishable-from-noise state?

  2. Loss function comparison: Implement both $\boldsymbol{\epsilon}$-prediction and $\mathbf{x}_0$-prediction training objectives. Train a small U-Net on MNIST for 50 epochs with each objective and compare the generated sample quality (visually and via FID if available).

  3. Guidance scale sweep: Using the diffusers library, generate 10 images of "a golden retriever playing in snow" with guidance scales $w \in \{1.0, 3.0, 5.0, 7.5, 10.0, 15.0, 20.0\}$. Document the quality-diversity tradeoff at each scale.

  4. LoRA fine-tuning: Fine-tune a Stable Diffusion model using LoRA on 5 images of a specific subject (your pet, a specific building, etc.) with DreamBooth. Experiment with LoRA rank values of 4, 8, and 16 and compare the quality of subject preservation versus creative generation.

  5. Scheduler comparison: Generate the same image (same seed, same prompt) using DDPM (1000 steps), DDIM (50 steps), DPM-Solver++ (20 steps), and Euler (25 steps). Compare the outputs and generation times.

27.16 Summary

Diffusion models represent a paradigm shift in generative modeling, combining the mathematical elegance of the forward-reverse diffusion framework with the practical power of modern neural network architectures. The key ideas are:

  1. The forward process gradually destroys data by adding noise according to a schedule, eventually reaching a simple Gaussian distribution.
  2. The reverse process learns to denoise step by step, with a neural network predicting the noise at each timestep.
  3. The training objective is remarkably simple: predict what noise was added. This is equivalent to score matching in the continuous-time limit.
  4. DDIM and fast samplers reduce the number of denoising steps from thousands to tens or even single steps.
  5. Latent diffusion dramatically improves efficiency by performing diffusion in a compressed latent space.
  6. Classifier-free guidance enables controllable generation by trading diversity for text-image alignment.
  7. ControlNet provides precise spatial control through additional conditioning signals.

The Stable Diffusion pipeline combines a VAE, a conditioned U-Net, and a CLIP text encoder into a system that can generate high-resolution images from text descriptions. With the techniques covered in this chapter — including fine-tuning methods like LoRA and DreamBooth — you have the knowledge to build, customize, and deploy diffusion-based generation systems.


References

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS.
  • Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. ICLR 2021.
  • Song, Y., Sohl-Dickstein, J., et al. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021.
  • Rombach, R., Blattmann, A., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.
  • Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS Workshop.
  • Nichol, A. & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021.
  • Dhariwal, P. & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021.
  • Zhang, L. & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models. ICCV 2023.
  • Sohl-Dickstein, J., Weiss, E., et al. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015.
  • Kingma, D. P. & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR 2014.
  • Goodfellow, I. J., et al. (2014). Generative Adversarial Nets. NeurIPS 2014.