40 min read

from undercomplete and sparse variants to denoising autoencoders, variational autoencoders with full ELBO derivation, and the modern self-supervised and contrastive learning paradigm."

In This Chapter

16.1 The Autoencoder Framework
16.2 Undercomplete Autoencoders
16.3 Sparse Autoencoders
16.4 Denoising Autoencoders
16.5 Contractive Autoencoders
16.6 Convolutional Autoencoders
16.7 Variational Autoencoders (VAEs)
16.8 Beyond Vanilla VAEs
16.9 Contrastive Learning
16.10 The Self-Supervised Learning Landscape
16.11 Practical Considerations
16.12 Summary
References

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

from undercomplete and sparse variants to denoising autoencoders, variational autoencoders with full ELBO derivation, and the modern self-supervised and contrastive learning paradigm." prerequisites: - "Chapter 11: Neural Networks from Scratch (forward/backward pass, gradient descent)" - "Chapter 12: Training Deep Networks (optimizers, learning rate schedules, batch normalization)" - "Chapter 13: Regularization and Generalization (dropout, weight decay, capacity control)" - "Chapter 14: Convolutional Neural Networks (convolutions, transposed convolutions, feature maps)" - "Chapter 4: Probability, Statistics, and Information Theory (KL divergence, Bayes' theorem, entropy)" learning_objectives: - "Understand the autoencoder framework as a learned compression and reconstruction pipeline" - "Implement undercomplete, sparse, and denoising autoencoders in PyTorch" - "Derive the Evidence Lower Bound (ELBO) and explain why it enables training of latent variable models" - "Implement the reparameterization trick and train a Variational Autoencoder end-to-end" - "Visualize and interpolate in latent spaces to build intuition for learned representations" - "Explain the contrastive learning paradigm and implement a simplified SimCLR pipeline" - "Distinguish between self-supervised pretext tasks and downstream evaluation" key_terms: - autoencoder - encoder - decoder - latent space - bottleneck - undercomplete autoencoder - sparse autoencoder - denoising autoencoder - variational autoencoder - evidence lower bound (ELBO) - reparameterization trick - reconstruction loss - KL divergence - posterior collapse - contrastive learning - SimCLR - BYOL - self-supervised learning - pretext task - representation learning estimated_time: "4-5 hours" difficulty: "Intermediate to Advanced"

Chapter 16: Autoencoders and Representation Learning

"What I cannot create, I do not understand." --- Richard Feynman

In Chapter 7, we explored unsupervised learning through classical methods like PCA, K-means, and DBSCAN. These algorithms discover structure in data without labels, but they are limited in the complexity of patterns they can capture. PCA finds linear subspaces. K-means finds spherical clusters. What if the underlying structure of the data is deeply nonlinear, hierarchical, and high-dimensional---as it is for images, speech, and text?

Autoencoders answer this question by learning to compress data through a neural network bottleneck and then reconstruct it. The bottleneck forces the network to discover a compact, informative representation---a latent code---that captures the essential factors of variation in the data. This simple idea, when extended with probabilistic reasoning (Variational Autoencoders), regularization (sparse and denoising variants), and modern training paradigms (contrastive and self-supervised learning), has become one of the pillars of modern representation learning.

In this chapter, we build the entire autoencoder family from the ground up. We start with the simplest undercomplete autoencoder, add regularization through sparsity and noise, derive the full mathematical framework of Variational Autoencoders, and then explore the self-supervised learning paradigm that has revolutionized how we pretrain deep networks. By the end, you will understand not just how these models work, but why they work---and you will have the tools to implement them all.

16.1 The Autoencoder Framework

16.1.1 Motivation: Why Learn Representations?

A representation of data is a transformation that makes subsequent tasks easier. When you standardize features before running logistic regression (Chapter 6), you are choosing a representation. When PCA projects data onto its principal components (Chapter 7), it learns a linear representation. The question is: can we learn a nonlinear representation that is optimized for the data itself?

Consider a dataset of face images, each $64 \times 64$ pixels. The raw representation is a 4,096-dimensional vector. But the space of realistic faces is far smaller than the space of all possible 4,096-dimensional vectors. Faces vary along a handful of interpretable factors: pose, lighting, expression, identity. If we could discover these factors automatically, we would have a representation that is both compact and meaningful.

This is exactly what autoencoders do.

16.1.2 Architecture Overview

An autoencoder consists of two neural networks:

Encoder $f_\phi$: Maps the input $\mathbf{x} \in \mathbb{R}^d$ to a latent representation $\mathbf{z} \in \mathbb{R}^k$ where $k$ is the latent dimension.
Decoder $g_\theta$: Maps the latent representation $\mathbf{z}$ back to a reconstruction $\hat{\mathbf{x}} \in \mathbb{R}^d$.

The full pipeline is:

$$\mathbf{x} \xrightarrow{f_\phi} \mathbf{z} \xrightarrow{g_\theta} \hat{\mathbf{x}}$$

Training minimizes a reconstruction loss:

$$\mathcal{L}(\phi, \theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\mathbf{x}_i, g_\theta(f_\phi(\mathbf{x}_i)))$$

where $\ell$ is typically the mean squared error (MSE) for continuous data or binary cross-entropy (BCE) for binary/image data:

$$\ell_{\text{MSE}}(\mathbf{x}, \hat{\mathbf{x}}) = \|\mathbf{x} - \hat{\mathbf{x}}\|^2$$

$$\ell_{\text{BCE}}(\mathbf{x}, \hat{\mathbf{x}}) = -\sum_{j=1}^{d} \left[ x_j \log \hat{x}_j + (1 - x_j) \log(1 - \hat{x}_j) \right]$$

The critical question is: what prevents the autoencoder from simply learning the identity function? If the network has enough capacity, it could map every input to itself through the latent space without learning anything useful. The answer lies in the bottleneck and various forms of regularization.

16.1.3 PyTorch Implementation of a Basic Autoencoder

Before we discuss theory further, let us build a concrete autoencoder in PyTorch. Seeing the code alongside the mathematics will ground your understanding.

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)


class SimpleAutoencoder(nn.Module):
    """A basic fully connected autoencoder.

    Args:
        input_dim: Dimensionality of the input (e.g., 784 for MNIST).
        hidden_dims: List of hidden layer sizes for the encoder.
            The decoder mirrors this in reverse.
        latent_dim: Dimensionality of the latent bottleneck.
    """

    def __init__(
        self,
        input_dim: int = 784,
        hidden_dims: list[int] = [256, 128],
        latent_dim: int = 32,
    ) -> None:
        super().__init__()
        # Build encoder layers
        encoder_layers = []
        prev_dim = input_dim
        for h_dim in hidden_dims:
            encoder_layers.append(nn.Linear(prev_dim, h_dim))
            encoder_layers.append(nn.ReLU())
            prev_dim = h_dim
        encoder_layers.append(nn.Linear(prev_dim, latent_dim))
        self.encoder = nn.Sequential(*encoder_layers)

        # Build decoder layers (mirror of encoder)
        decoder_layers = []
        prev_dim = latent_dim
        for h_dim in reversed(hidden_dims):
            decoder_layers.append(nn.Linear(prev_dim, h_dim))
            decoder_layers.append(nn.ReLU())
            prev_dim = h_dim
        decoder_layers.append(nn.Linear(prev_dim, input_dim))
        decoder_layers.append(nn.Sigmoid())  # Output in [0, 1]
        self.decoder = nn.Sequential(*decoder_layers)

    def encode(self, x: torch.Tensor) -> torch.Tensor:
        return self.encoder(x)

    def decode(self, z: torch.Tensor) -> torch.Tensor:
        return self.decoder(z)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        z = self.encode(x)
        x_hat = self.decode(z)
        return x_hat


# --- Demo ---
model = SimpleAutoencoder(input_dim=784, latent_dim=32)
x = torch.randn(16, 784).sigmoid()  # Simulated batch of images
x_hat = model(x)
loss = F.mse_loss(x_hat, x)
print(f"Reconstruction loss: {loss.item():.4f}")
print(f"Encoder parameters: {sum(p.numel() for p in model.encoder.parameters()):,}")
print(f"Decoder parameters: {sum(p.numel() for p in model.decoder.parameters()):,}")

Notice the symmetry: the encoder compresses 784 dimensions down through 256 and 128 to the 32-dimensional bottleneck, and the decoder reverses that path. The Sigmoid at the decoder output ensures reconstructions lie in $[0, 1]$, matching the range of normalized pixel values.

16.1.4 Relationship to PCA

When the encoder and decoder are both single linear layers with no activation functions, and the loss is MSE, the autoencoder learns the same subspace as PCA. Specifically, the encoder's weight matrix spans the same subspace as the top-$k$ principal components (though the basis vectors may differ by a rotation).

To see this formally, consider a linear autoencoder with encoder weights $\mathbf{W}_e \in \mathbb{R}^{k \times d}$ and decoder weights $\mathbf{W}_d \in \mathbb{R}^{d \times k}$. The reconstruction is $\hat{\mathbf{x}} = \mathbf{W}_d \mathbf{W}_e \mathbf{x}$. The MSE loss becomes:

$$\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \|\mathbf{x}_i - \mathbf{W}_d \mathbf{W}_e \mathbf{x}_i\|^2$$

The optimal solution requires $\mathbf{W}_d \mathbf{W}_e$ to be the projection onto the $k$-dimensional subspace spanned by the top-$k$ eigenvectors of the data covariance matrix $\boldsymbol{\Sigma} = \frac{1}{N} \sum_i \mathbf{x}_i \mathbf{x}_i^\top$ (assuming zero-mean data). This is precisely the PCA solution, as we derived in Chapter 7.

This is a powerful sanity check: the linear autoencoder recovers PCA. Everything we build on top---nonlinear activations, convolutional layers, probabilistic objectives---extends beyond what PCA can do. Nonlinear autoencoders can capture curved manifolds, learn hierarchical features, and model complex dependencies that no linear method can represent.

16.1.5 Historical Context

The idea of learning compressed representations through neural networks dates back to the 1980s. Hinton and Salakhutdinov (2006) demonstrated that deep autoencoders with multiple hidden layers could learn far more powerful representations than PCA, producing lower reconstruction error with fewer latent dimensions on tasks like image compression and document retrieval. Their work was instrumental in the "deep learning renaissance" --- showing that deep networks, when properly trained, could learn representations that shallow methods could not match.

Earlier, Bourlard and Kamp (1988) had proven the equivalence between linear autoencoders and PCA, establishing the theoretical foundation that motivated the search for nonlinear extensions. The progression from linear PCA to nonlinear autoencoders mirrors a broader theme in machine learning: start with a simple, well-understood method and then generalize it using neural networks.

16.2 Undercomplete Autoencoders

16.2.1 The Bottleneck Constraint

The simplest way to prevent the identity mapping is to make the latent dimension $k$ smaller than the input dimension $d$. This is called an undercomplete autoencoder. The bottleneck forces the encoder to compress the input, discarding irrelevant information and retaining only the most salient features.

For an input of dimension $d = 784$ (a $28 \times 28$ MNIST image) and a latent dimension of $k = 32$, the autoencoder must compress 784 values into 32---a 24.5x compression ratio. If the reconstruction is good despite this compression, the 32-dimensional latent code must capture the essential structure of the data.

16.2.2 Architecture Design

A typical undercomplete autoencoder for MNIST uses fully connected layers with decreasing widths in the encoder and increasing widths in the decoder:

Encoder: 784 → 256 → 128 → 32 (latent)
Decoder: 32 → 128 → 256 → 784

The encoder progressively compresses information, and the decoder progressively reconstructs it. Activation functions (ReLU, LeakyReLU) introduce nonlinearity, allowing the autoencoder to learn curved manifolds rather than flat subspaces.

The final decoder layer uses an appropriate activation: - Sigmoid: When inputs are in $[0, 1]$ (e.g., normalized images), paired with BCE loss. - None (linear): When inputs are real-valued, paired with MSE loss. - Tanh: When inputs are in $[-1, 1]$.

16.2.3 Training and Evaluation

Training an autoencoder is straightforward: feed inputs through the encoder and decoder, compute the reconstruction loss, and backpropagate. There are no labels involved---this is unsupervised learning.

Evaluation, however, is subtle. A low reconstruction loss means the autoencoder can compress and decompress well, but it does not directly tell us whether the learned representations are useful. We evaluate representations through:

Visual inspection: Do reconstructed images look reasonable?
Downstream task performance: Train a simple classifier on the latent codes. If a linear classifier on 32-dimensional codes achieves high accuracy, the codes are informative.
Interpolation: Do linear interpolations between latent codes produce smooth, realistic transitions?
Clustering: Do similar inputs cluster together in latent space?

16.2.4 Limitations

Undercomplete autoencoders have a fundamental problem: if the encoder and decoder have high capacity (deep networks with many parameters), they can learn to memorize the training data even with a small bottleneck. The latent space may become fragmented and unstructured---nearby points in latent space may decode to wildly different outputs. This motivates the regularized variants we explore next.

16.3 Sparse Autoencoders

16.3.1 Sparsity as Regularization

Instead of (or in addition to) constraining the bottleneck size, we can regularize the activations of the latent layer. A sparse autoencoder encourages the latent code $\mathbf{z}$ to have mostly zero (or near-zero) entries, with only a few active units for any given input. This is analogous to L1 regularization on model weights (Chapter 13), but applied to activations.

The loss function becomes:

$$\mathcal{L} = \ell(\mathbf{x}, \hat{\mathbf{x}}) + \lambda \Omega(\mathbf{z})$$

where $\Omega(\mathbf{z})$ is the sparsity penalty and $\lambda > 0$ controls its strength.

16.3.2 L1 Sparsity

The simplest sparsity penalty is the L1 norm of the latent activations:

$$\Omega_{L1}(\mathbf{z}) = \|\mathbf{z}\|_1 = \sum_{j=1}^{k} |z_j|$$

This directly penalizes the magnitude of each latent dimension, pushing most values toward zero.

16.3.3 KL Divergence Sparsity

A more principled approach treats the average activation of each latent unit $j$ across the training batch as a Bernoulli probability $\hat{\rho}_j$, and penalizes its deviation from a target sparsity $\rho$ (e.g., $\rho = 0.05$):

$$\hat{\rho}_j = \frac{1}{N}\sum_{i=1}^{N} z_j^{(i)}$$

$$\Omega_{\text{KL}} = \sum_{j=1}^{k} \text{KL}(\rho \| \hat{\rho}_j) = \sum_{j=1}^{k} \left[ \rho \log \frac{\rho}{\hat{\rho}_j} + (1 - \rho) \log \frac{1 - \rho}{1 - \hat{\rho}_j} \right]$$

This penalty is zero when $\hat{\rho}_j = \rho$ for all $j$, and increases as the average activation deviates from the target. It requires sigmoid activations in the latent layer so that activations are in $[0, 1]$.

16.3.4 Overcomplete Sparse Autoencoders

An interesting consequence of sparsity regularization is that we can use a latent dimension larger than the input dimension ($k > d$). This is an overcomplete representation. Without sparsity, an overcomplete autoencoder trivially learns the identity. With sparsity, it learns a useful dictionary of features: each input activates only a small subset of the available latent units.

This connects to the theory of sparse coding and compressed sensing from signal processing: if the data has a sparse representation in some basis, we can recover it from fewer measurements than the ambient dimension would suggest.

16.3.5 PyTorch Implementation of a Sparse Autoencoder

Let us implement a sparse autoencoder with L1 sparsity. The key change from the basic autoencoder is adding the sparsity penalty to the loss function.

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)


class SparseAutoencoder(nn.Module):
    """Autoencoder with L1 sparsity on latent activations.

    Args:
        input_dim: Dimensionality of the input.
        latent_dim: Dimensionality of the latent layer (can be > input_dim).
        sparsity_weight: Coefficient lambda for L1 penalty.
    """

    def __init__(
        self,
        input_dim: int = 784,
        latent_dim: int = 1024,
        sparsity_weight: float = 1e-3,
    ) -> None:
        super().__init__()
        self.sparsity_weight = sparsity_weight
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim),
            nn.ReLU(),
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, input_dim),
            nn.Sigmoid(),
        )

    def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return x_hat, z

    def loss(self, x: torch.Tensor) -> torch.Tensor:
        x_hat, z = self.forward(x)
        recon_loss = F.mse_loss(x_hat, x)
        sparsity_loss = self.sparsity_weight * z.abs().mean()
        return recon_loss + sparsity_loss


# --- Demo ---
model = SparseAutoencoder(input_dim=784, latent_dim=1024)
x = torch.randn(16, 784).sigmoid()
total_loss = model.loss(x)
print(f"Total loss: {total_loss.item():.4f}")

# Check sparsity of activations
_, z = model(x)
active_fraction = (z > 0.01).float().mean().item()
print(f"Fraction of active latent units: {active_fraction:.4f}")

Notice that the latent dimension (1024) is larger than the input dimension (784), making this an overcomplete autoencoder. Without the sparsity penalty, the network could trivially learn the identity mapping. The L1 term on activations forces most latent units to be inactive for any given input, encouraging the network to learn a diverse dictionary of features.

16.3.6 Sparse Autoencoders and Mechanistic Interpretability

A fascinating recent application of sparse autoencoders is in mechanistic interpretability of large language models. Researchers at Anthropic (Bricken et al., 2023) and others have used sparse autoencoders to decompose the internal activations of Transformer models into interpretable features. The key insight is that neural network activations are superposed---they represent multiple features simultaneously in overlapping directions. A sparse autoencoder with a large overcomplete basis can disentangle these superposed features into individual, interpretable units.

For example, training a sparse autoencoder on the residual stream activations of a language model might yield latent units that activate specifically for concepts like "DNA sequences," "French text," "mathematical proofs," or "sarcastic tone." This connects the classical idea of sparse coding to the cutting-edge challenge of understanding what large neural networks have learned.

16.3.7 When to Use Sparse Autoencoders

Sparse autoencoders are particularly useful when: - You want interpretable features (each latent unit corresponds to a specific feature). - The data has a naturally sparse structure (e.g., images where only a few edges or textures are present at any location). - You want an overcomplete representation for downstream tasks. - You need to analyze the internal representations of other neural networks (mechanistic interpretability).

16.4 Denoising Autoencoders

16.4.1 Corruption as Regularization

Denoising autoencoders (DAEs) take a different approach to preventing trivial solutions: they corrupt the input and train the network to reconstruct the clean original. The corruption process forces the encoder to learn robust features that capture the underlying data distribution rather than surface-level details.

Given a clean input $\mathbf{x}$, we first apply a corruption process $\tilde{\mathbf{x}} \sim q(\tilde{\mathbf{x}} | \mathbf{x})$ and then train the autoencoder to reconstruct $\mathbf{x}$ from $\tilde{\mathbf{x}}$:

$$\mathcal{L} = \ell(\mathbf{x}, g_\theta(f_\phi(\tilde{\mathbf{x}})))$$

Note that the loss compares the output to the clean input $\mathbf{x}$, not the corrupted input $\tilde{\mathbf{x}}$.

16.4.2 Common Corruption Strategies

Several corruption strategies are used in practice:

Additive Gaussian noise: $\tilde{\mathbf{x}} = \mathbf{x} + \boldsymbol{\epsilon}$, where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})$. The noise level $\sigma$ controls the difficulty of the denoising task.

Masking noise (dropout): Each input dimension is independently set to zero with probability $p$. This is equivalent to applying dropout to the input:

$$\tilde{x}_j = \begin{cases} x_j & \text{with probability } 1 - p \\ 0 & \text{with probability } p \end{cases}$$

Salt-and-pepper noise: Each dimension is independently replaced with 0 or 1 with probability $p/2$ each.

16.4.3 Theoretical Motivation

Vincent et al. (2010) showed that training a denoising autoencoder is equivalent to learning a vector field that points toward the data manifold from nearby points. Formally, in the limit of small noise, the DAE learns to estimate the score function:

$$\nabla_{\mathbf{x}} \log p(\mathbf{x})$$

This connection to score matching and the data distribution makes denoising autoencoders theoretically well-grounded. It also foreshadows the score-based diffusion models that have become state-of-the-art for generation (which use a closely related principle).

16.4.4 PyTorch Implementation of a Denoising Autoencoder

The implementation requires only a small modification to the training loop: we add noise to the input before encoding, but compute the loss against the clean input.

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)


class DenoisingAutoencoder(nn.Module):
    """Denoising autoencoder with configurable corruption.

    Args:
        input_dim: Dimensionality of the input.
        latent_dim: Dimensionality of the latent bottleneck.
        noise_type: Type of corruption ('gaussian' or 'masking').
        noise_level: Standard deviation for Gaussian noise, or
            dropout probability for masking noise.
    """

    def __init__(
        self,
        input_dim: int = 784,
        latent_dim: int = 128,
        noise_type: str = "gaussian",
        noise_level: float = 0.3,
    ) -> None:
        super().__init__()
        self.noise_type = noise_type
        self.noise_level = noise_level

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, latent_dim),
            nn.ReLU(),
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, input_dim),
            nn.Sigmoid(),
        )

    def corrupt(self, x: torch.Tensor) -> torch.Tensor:
        """Apply corruption to the input."""
        if self.noise_type == "gaussian":
            noise = torch.randn_like(x) * self.noise_level
            return (x + noise).clamp(0, 1)
        elif self.noise_type == "masking":
            mask = torch.bernoulli(
                torch.full_like(x, 1.0 - self.noise_level)
            )
            return x * mask
        else:
            raise ValueError(f"Unknown noise type: {self.noise_type}")

    def forward(
        self, x: torch.Tensor, add_noise: bool = True
    ) -> torch.Tensor:
        if add_noise and self.training:
            x_corrupted = self.corrupt(x)
        else:
            x_corrupted = x
        z = self.encoder(x_corrupted)
        x_hat = self.decoder(z)
        return x_hat


# --- Demo ---
model = DenoisingAutoencoder(noise_type="masking", noise_level=0.5)
model.train()
x_clean = torch.randn(16, 784).sigmoid()
x_hat = model(x_clean, add_noise=True)
loss = F.mse_loss(x_hat, x_clean)  # Compare to CLEAN input
print(f"Denoising loss: {loss.item():.4f}")

The critical detail is on the loss line: we compare x_hat to x_clean, not to the corrupted input. This forces the network to learn the structure of clean data rather than memorizing the corruption pattern.

16.4.5 Connection to Score Matching and Diffusion Models

The theoretical connection between denoising autoencoders and score matching deserves deeper exploration, because it directly leads to the diffusion models we will study in Chapter 27.

The score function of a distribution $p(\mathbf{x})$ is the gradient of its log-density:

$$\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$$

Intuitively, the score function is a vector field that points from any point in space toward regions of higher probability --- it tells you which direction to move to reach more likely data points.

Vincent (2011) proved that training a denoising autoencoder with Gaussian noise of standard deviation $\sigma$ is equivalent to learning the score function of the data distribution smoothed by a Gaussian kernel of width $\sigma$. Specifically, the optimal denoising function $r^*(\tilde{\mathbf{x}})$ satisfies:

$$r^*(\tilde{\mathbf{x}}) - \tilde{\mathbf{x}} \propto \sigma^2 \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}})$$

where $p_\sigma$ is the data distribution convolved with $\mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})$.

This result is profound: the vector from a noisy point $\tilde{\mathbf{x}}$ to its denoised version $r^*(\tilde{\mathbf{x}})$ is proportional to the score of the smoothed data distribution. Diffusion models (Chapter 27) exploit this connection by training a neural network to denoise at many different noise levels, effectively learning the score function across multiple scales. Generation then proceeds by starting from pure noise and iteratively denoising --- following the learned score field back to the data manifold.

16.4.6 Stacked Denoising Autoencoders

In a pre-deep-learning era, stacked denoising autoencoders were used for layer-wise pretraining of deep networks. The idea is to train a sequence of denoising autoencoders, each operating on the latent codes of the previous one, and then fine-tune the entire stack for a supervised task. While this approach has been largely superseded by modern training techniques (Chapter 12), it was historically important and demonstrates that denoising autoencoders learn hierarchical features.

16.5 Contractive Autoencoders

16.5.1 Penalizing the Jacobian

A contractive autoencoder (CAE) regularizes the encoder by penalizing the sensitivity of the latent representation to small changes in the input. The intuition is that a good representation should be robust to small perturbations that do not change the identity of the data point. If slightly moving a face image along the pixel direction of "add a bright pixel in the corner" drastically changes the latent code, the encoder is capturing noise rather than structure.

Formally, the contractive penalty is the squared Frobenius norm of the encoder's Jacobian matrix:

$$\Omega_{\text{CAE}} = \left\| \frac{\partial f_\phi(\mathbf{x})}{\partial \mathbf{x}} \right\|_F^2 = \sum_{j=1}^{k} \left\| \frac{\partial z_j}{\partial \mathbf{x}} \right\|^2$$

where:

$f_\phi(\mathbf{x})$ is the encoder mapping
$z_j$ is the $j$-th component of the latent code
The Frobenius norm sums the squared magnitudes of all partial derivatives

The total loss is:

$$\mathcal{L}_{\text{CAE}} = \ell(\mathbf{x}, \hat{\mathbf{x}}) + \lambda \left\| \frac{\partial f_\phi(\mathbf{x})}{\partial \mathbf{x}} \right\|_F^2$$

Worked example. Consider a single-layer encoder with sigmoid activation: $\mathbf{z} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$. The Jacobian is:

$$\mathbf{J} = \frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \text{diag}(\mathbf{z} \odot (1 - \mathbf{z})) \mathbf{W}$$

where $\mathbf{z} \odot (1 - \mathbf{z})$ is the element-wise derivative of the sigmoid. The Frobenius norm becomes:

$$\|\mathbf{J}\|_F^2 = \sum_{j=1}^{k} z_j^2 (1 - z_j)^2 \|\mathbf{w}_j\|^2$$

where $\mathbf{w}_j$ is the $j$-th row of $\mathbf{W}$. This penalty is small when the latent activations are saturated ($z_j \approx 0$ or $z_j \approx 1$, where the sigmoid is flat) or when the weight vectors are small.

16.5.2 Relationship to Denoising Autoencoders

Contractive autoencoders and denoising autoencoders are closely related. Rifai et al. (2011) showed that in the limit of small noise, the denoising autoencoder's reconstruction objective implicitly regularizes the Jacobian of the encoder. Specifically, for Gaussian noise of variance $\sigma^2$:

$$\mathbb{E}_{\tilde{\mathbf{x}} \sim q(\tilde{\mathbf{x}}|\mathbf{x})} [\|\hat{\mathbf{x}} - \mathbf{x}\|^2] \approx \|\hat{\mathbf{x}}_0 - \mathbf{x}\|^2 + \sigma^2 \left\| \frac{\partial g_\theta(f_\phi(\mathbf{x}))}{\partial \mathbf{x}} \right\|_F^2$$

where $\hat{\mathbf{x}}_0$ is the reconstruction from the clean input. The second term is a Jacobian penalty on the full autoencoder (encoder followed by decoder), which is related to but not identical to the contractive penalty on the encoder alone.

This connection reveals a unifying principle: all three regularized autoencoder variants (sparse, denoising, contractive) encourage the encoder to learn representations that are robust to irrelevant variations in the input. They differ in how they enforce this robustness, but the goal is the same.

16.6 Convolutional Autoencoders

16.5.1 Leveraging Spatial Structure

When working with images, fully connected autoencoders ignore the spatial structure that makes images special. Convolutional autoencoders use convolutional layers in the encoder and transposed convolutional (or upsampling + convolution) layers in the decoder, preserving spatial relationships and dramatically reducing parameter counts.

A convolutional encoder for $28 \times 28$ grayscale images might look like:

Conv2d(1, 32, 3, stride=2, padding=1)  → 32 × 14 × 14
Conv2d(32, 64, 3, stride=2, padding=1) → 64 × 7 × 7
Flatten → 3136
Linear(3136, latent_dim)

The decoder reverses this process:

Linear(latent_dim, 3136)
Reshape → 64 × 7 × 7
ConvTranspose2d(64, 32, 3, stride=2, padding=1, output_padding=1) → 32 × 14 × 14
ConvTranspose2d(32, 1, 3, stride=2, padding=1, output_padding=1)  → 1 × 28 × 28

16.5.2 Transposed Convolutions vs. Upsampling

There are two common strategies for the decoder:

Transposed convolutions (sometimes misleadingly called "deconvolutions"): These learnable upsampling layers increase spatial resolution. They can produce checkerboard artifacts due to uneven overlap patterns.
Nearest-neighbor upsampling + convolution: First upsample using nearest-neighbor interpolation, then apply a standard convolution. This avoids checkerboard artifacts and often produces smoother outputs.

Both approaches work well in practice, but the upsampling + convolution approach is generally preferred when visual quality matters.

16.7 Variational Autoencoders (VAEs)

16.7.1 From Autoencoders to Generative Models

The autoencoders we have seen so far are deterministic: a fixed input always maps to the same latent code. This makes them useful for compression and feature learning, but not for generation. If we want to sample new data points, we need a model that defines a probability distribution over the latent space.

A Variational Autoencoder (VAE) reimagines the autoencoder through the lens of probabilistic generative modeling. Instead of learning a deterministic mapping, the VAE learns a probability distribution in latent space and uses it to generate new data.

16.7.2 The Generative Model

The VAE defines a generative process:

Sample a latent code from a prior: $\mathbf{z} \sim p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$
Generate a data point from the latent code: $\mathbf{x} \sim p_\theta(\mathbf{x} | \mathbf{z})$

The decoder network parameterizes $p_\theta(\mathbf{x} | \mathbf{z})$. For binary data, this is a Bernoulli distribution whose parameters are the sigmoid outputs of the decoder. For continuous data, it is a Gaussian whose mean is the decoder output.

Our goal is to maximize the marginal likelihood (also called the evidence):

$$p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x} | \mathbf{z}) p(\mathbf{z}) \, d\mathbf{z}$$

This integral is intractable because it requires integrating over all possible latent codes. We cannot evaluate it, let alone optimize it directly.

16.7.3 The Variational Inference Framework

Since the true posterior $p_\theta(\mathbf{z} | \mathbf{x})$ is intractable, we introduce an approximate posterior $q_\phi(\mathbf{z} | \mathbf{x})$---the encoder network---that approximates it. The encoder outputs the parameters of a Gaussian distribution:

$$q_\phi(\mathbf{z} | \mathbf{x}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}_\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}_\phi^2(\mathbf{x})))$$

The encoder network takes input $\mathbf{x}$ and outputs two vectors: the mean $\boldsymbol{\mu}$ and the log-variance $\log \boldsymbol{\sigma}^2$. We parameterize the log-variance rather than the variance directly because it can take any real value, making optimization easier.

16.7.4 Deriving the Evidence Lower Bound (ELBO)

We now derive the ELBO, the objective function that makes VAE training possible. Start with the log-marginal likelihood:

$$\log p_\theta(\mathbf{x}) = \log \int p_\theta(\mathbf{x} | \mathbf{z}) p(\mathbf{z}) \, d\mathbf{z}$$

Introduce the approximate posterior $q_\phi(\mathbf{z} | \mathbf{x})$ by multiplying and dividing inside the integral:

$$\log p_\theta(\mathbf{x}) = \log \int \frac{p_\theta(\mathbf{x} | \mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z} | \mathbf{x})} q_\phi(\mathbf{z} | \mathbf{x}) \, d\mathbf{z}$$

By Jensen's inequality ($\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$ for concave $\log$):

$$\log p_\theta(\mathbf{x}) \geq \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p_\theta(\mathbf{x} | \mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z} | \mathbf{x})} \right]$$

Expanding and rearranging:

$$\log p_\theta(\mathbf{x}) \geq \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})} [\log p_\theta(\mathbf{x} | \mathbf{z})]}_{\text{Reconstruction term}} - \underbrace{D_{\text{KL}}(q_\phi(\mathbf{z} | \mathbf{x}) \| p(\mathbf{z}))}_{\text{Regularization term}}$$

This is the Evidence Lower Bound (ELBO). Let us examine each term:

Reconstruction term: $\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})} [\log p_\theta(\mathbf{x} | \mathbf{z})]$ measures how well the decoder reconstructs the input from samples of the approximate posterior. This is the negative of the reconstruction loss.
KL divergence term: $D_{\text{KL}}(q_\phi(\mathbf{z} | \mathbf{x}) \| p(\mathbf{z}))$ measures how far the approximate posterior deviates from the prior $\mathcal{N}(\mathbf{0}, \mathbf{I})$. It acts as a regularizer, preventing the encoder from placing all mass on a single point (which would reduce to a deterministic autoencoder).

We can also derive the ELBO without Jensen's inequality. The key identity is:

$$\log p_\theta(\mathbf{x}) = \text{ELBO}(\phi, \theta; \mathbf{x}) + D_{\text{KL}}(q_\phi(\mathbf{z} | \mathbf{x}) \| p_\theta(\mathbf{z} | \mathbf{x}))$$

Since $D_{\text{KL}} \geq 0$, the ELBO is always a lower bound on the log-evidence. Maximizing the ELBO simultaneously: (a) maximizes the log-evidence $\log p_\theta(\mathbf{x})$ (making the model fit the data), and (b) minimizes $D_{\text{KL}}(q_\phi \| p_\theta(\mathbf{z}|\mathbf{x}))$ (making the approximation tight).

16.7.5 Closed-Form KL Divergence

For the special case where both $q_\phi(\mathbf{z}|\mathbf{x})$ and $p(\mathbf{z})$ are Gaussian, the KL divergence has a convenient closed form. Let $q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$ and $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$:

$$D_{\text{KL}}(q \| p) = -\frac{1}{2}\sum_{j=1}^{k}\left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right)$$

This expression is differentiable and cheap to compute. No sampling is needed for the KL term---only the reconstruction term requires sampling.

16.7.6 The Reparameterization Trick

To compute gradients of the reconstruction term $\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})} [\log p_\theta(\mathbf{x} | \mathbf{z})]$ with respect to $\phi$, we need to backpropagate through the sampling operation $\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})$. But sampling is not differentiable.

The reparameterization trick solves this elegantly. Instead of sampling $\mathbf{z}$ directly from $q_\phi$, we express $\mathbf{z}$ as a deterministic, differentiable function of $\phi$ and an independent noise variable $\boldsymbol{\epsilon}$:

$$\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$ $$\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}$$

where $\odot$ denotes element-wise multiplication. Now $\mathbf{z}$ is a differentiable function of $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ (both outputs of the encoder), so gradients flow through $\mathbf{z}$ back to the encoder parameters $\phi$.

This trick transforms the problem from "differentiating through a stochastic node" to "differentiating through a deterministic computation with external randomness." It is one of the key ideas that made VAEs practical.

16.7.7 The Complete VAE Loss

Putting everything together, the VAE loss for a single data point $\mathbf{x}$ is:

$$\mathcal{L}_{\text{VAE}}(\phi, \theta; \mathbf{x}) = -\text{ELBO} = -\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})} [\log p_\theta(\mathbf{x} | \mathbf{z})] + D_{\text{KL}}(q_\phi(\mathbf{z} | \mathbf{x}) \| p(\mathbf{z}))$$

In practice, the expectation is approximated with a single Monte Carlo sample ($L = 1$):

$$\mathcal{L}_{\text{VAE}} \approx -\log p_\theta(\mathbf{x} | \mathbf{z}^{(1)}) + D_{\text{KL}}(q_\phi(\mathbf{z} | \mathbf{x}) \| p(\mathbf{z}))$$

where $\mathbf{z}^{(1)} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}^{(1)}$ and $\boldsymbol{\epsilon}^{(1)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.

For binary data (like binarized MNIST), $-\log p_\theta(\mathbf{x}|\mathbf{z})$ is the binary cross-entropy. For continuous data, it is proportional to the MSE.

16.7.8 VAE Training in Practice

Architecture: The encoder outputs $2k$ values (mean and log-variance for each of $k$ latent dimensions). The decoder takes a $k$-dimensional latent code and outputs the reconstruction.

KL annealing: A common trick is to gradually increase the weight of the KL term from 0 to 1 during training:

$$\mathcal{L} = -\mathbb{E}[\log p_\theta(\mathbf{x}|\mathbf{z})] + \beta \cdot D_{\text{KL}}(q_\phi \| p)$$

where $\beta$ increases linearly from 0 to 1 over a warm-up period. This prevents the KL term from dominating early in training, which can cause posterior collapse---a pathology where the encoder ignores the input and the posterior collapses to the prior.

$\beta$-VAE: Higgins et al. (2017) showed that setting $\beta > 1$ encourages disentangled representations, where each latent dimension controls a single factor of variation (e.g., one dimension for rotation, another for size). This comes at the cost of reconstruction quality, creating a fundamental tension: higher $\beta$ produces more disentangled but blurrier outputs.

The intuition behind $\beta$-VAE is information-theoretic. A large $\beta$ more aggressively pushes the approximate posterior toward the factorized prior $\mathcal{N}(\mathbf{0}, \mathbf{I})$, which has independent dimensions. This pressure forces each latent dimension to capture a statistically independent factor of variation. If $z_1$ controls face orientation and $z_2$ controls hair color, changing $z_1$ should not affect $z_2$.

Worked example. Suppose we train a $\beta$-VAE on face images with $k = 10$ latent dimensions and $\beta = 4$. After training, we can traverse each latent dimension independently:

Varying $z_1$ from $-3$ to $+3$ while holding all others at 0 might produce faces that rotate from left-facing to right-facing.
Varying $z_2$ might change skin tone.
Varying $z_3$ might toggle glasses on/off.

With a standard VAE ($\beta = 1$), these factors tend to be entangled --- changing one latent dimension affects multiple visual attributes simultaneously. The quantitative measure of disentanglement is often computed using the $\beta$-VAE metric or the DCI (Disentanglement, Completeness, Informativeness) framework.

In practice, choosing $\beta$ requires tuning. Values of $\beta \in [2, 10]$ are common. Larger $\beta$ values produce more disentangled representations at the cost of reconstruction fidelity. The AnnealedVAE approach (Burgess et al., 2018) gradually increases the capacity of the latent channel, providing a more principled alternative to manual $\beta$ tuning.

16.7.9 Complete VAE Implementation in PyTorch

Let us implement a complete VAE for MNIST to solidify the concepts:

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)


class VAE(nn.Module):
    """Variational Autoencoder for MNIST.

    Args:
        input_dim: Input dimensionality (784 for MNIST).
        hidden_dim: Hidden layer size.
        latent_dim: Dimensionality of the latent space.
    """

    def __init__(
        self,
        input_dim: int = 784,
        hidden_dim: int = 400,
        latent_dim: int = 20,
    ) -> None:
        super().__init__()
        # Encoder: x -> hidden -> (mu, log_var)
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)
        # Decoder: z -> hidden -> x_hat
        self.fc3 = nn.Linear(latent_dim, hidden_dim)
        self.fc4 = nn.Linear(hidden_dim, input_dim)

    def encode(
        self, x: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor]:
        h = F.relu(self.fc1(x))
        mu = self.fc_mu(h)
        log_var = self.fc_logvar(h)
        return mu, log_var

    def reparameterize(
        self, mu: torch.Tensor, log_var: torch.Tensor
    ) -> torch.Tensor:
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + std * eps

    def decode(self, z: torch.Tensor) -> torch.Tensor:
        h = F.relu(self.fc3(z))
        return torch.sigmoid(self.fc4(h))

    def forward(
        self, x: torch.Tensor
    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        x_hat = self.decode(z)
        return x_hat, mu, log_var


def vae_loss(
    x: torch.Tensor,
    x_hat: torch.Tensor,
    mu: torch.Tensor,
    log_var: torch.Tensor,
) -> torch.Tensor:
    """Compute VAE loss = reconstruction + KL divergence."""
    recon = F.binary_cross_entropy(x_hat, x, reduction="sum")
    kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return recon + kl


# --- Demo ---
model = VAE(latent_dim=20)
x = torch.randn(32, 784).sigmoid()  # Simulated batch
x_hat, mu, log_var = model(x)
loss = vae_loss(x, x_hat, mu, log_var)
print(f"VAE loss: {loss.item():.1f}")
print(f"Mean of mu: {mu.mean().item():.4f}")
print(f"Mean of exp(log_var): {log_var.exp().mean().item():.4f}")

Note the three key components: (1) the encoder outputs both $\boldsymbol{\mu}$ and $\log \boldsymbol{\sigma}^2$, (2) the reparameterize method implements the reparameterization trick, and (3) the loss function combines binary cross-entropy reconstruction with the closed-form KL divergence.

16.7.10 Latent Space Visualization

One of the most compelling aspects of VAEs is their structured latent space. Because the KL term regularizes the posterior toward the standard Gaussian prior, the latent space is smooth and continuous.

For a 2D latent space, we can visualize the manifold by:

Encoding and plotting: Encode all training points and plot them in 2D, colored by class. You should see smooth clusters with gradual transitions between classes.
Grid decoding: Create a grid of points in latent space (e.g., from $-3$ to $3$ in both dimensions), decode each point, and display the resulting images. This reveals the generative manifold.
Interpolation: Encode two data points, linearly interpolate between their latent codes, and decode each intermediate point. Smooth transitions indicate a well-structured latent space.

Spherical linear interpolation (slerp). For higher-dimensional latent spaces, linear interpolation can pass through low-density regions of the prior (the center of a high-dimensional Gaussian has low probability). Spherical linear interpolation (slerp) follows the surface of the hypersphere, staying in higher-density regions:

$$\text{slerp}(\mathbf{z}_1, \mathbf{z}_2, t) = \frac{\sin((1-t)\theta)}{\sin(\theta)} \mathbf{z}_1 + \frac{\sin(t\theta)}{\sin(\theta)} \mathbf{z}_2$$

where $\theta = \arccos(\hat{\mathbf{z}}_1^\top \hat{\mathbf{z}}_2)$ is the angle between the normalized vectors. This typically produces smoother interpolations than linear interpolation, especially for latent spaces with more than 2 dimensions.

16.7.11 Posterior Collapse

Posterior collapse occurs when the approximate posterior $q_\phi(\mathbf{z}|\mathbf{x})$ matches the prior $p(\mathbf{z})$ for all inputs---the encoder ignores the input entirely. The KL term becomes zero, and the decoder learns to generate outputs from pure noise, often producing blurry averages.

This happens when the decoder is too powerful (e.g., an autoregressive decoder that can model the data without using the latent code) or when the KL term overwhelms the reconstruction term early in training.

Mitigations include: - KL annealing: Gradually increase the KL weight. - Free bits: Set a minimum KL budget per dimension (Kingma et al., 2016). - Decoder capacity limitation: Use a less powerful decoder to force reliance on the latent code. - Aggressive encoder training: Update the encoder more frequently than the decoder.

16.8 Beyond Vanilla VAEs

16.8.1 Conditional VAEs

A Conditional VAE (CVAE) conditions both the encoder and decoder on additional information $\mathbf{c}$ (e.g., a class label):

$$q_\phi(\mathbf{z} | \mathbf{x}, \mathbf{c}), \quad p_\theta(\mathbf{x} | \mathbf{z}, \mathbf{c})$$

This allows controlled generation: "Generate a digit that looks like a 7" or "Generate a face with glasses."

16.8.2 VQ-VAE: Vector Quantized Variational Autoencoders

The Vector Quantized VAE (van den Oord et al., 2017) takes a fundamentally different approach to the latent space: instead of a continuous Gaussian distribution, VQ-VAE uses a discrete codebook of learned embedding vectors. This design choice avoids the blurriness inherent in continuous VAEs and enables high-quality generation.

How VQ-VAE works. The architecture has three components:

Encoder $f_\phi$: Maps the input $\mathbf{x}$ to a continuous representation $\mathbf{z}_e = f_\phi(\mathbf{x})$.
Codebook $\mathbf{E} = \{\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_K\}$: A collection of $K$ learned embedding vectors, each of dimension $d_z$. Typical values are $K = 512$ or $K = 8192$.
Quantization: The encoder output is mapped to the nearest codebook vector:

$$\mathbf{z}_q = \mathbf{e}_k, \quad \text{where} \quad k = \arg\min_j \|\mathbf{z}_e - \mathbf{e}_j\|_2$$

Decoder $g_\theta$: Reconstructs the input from the quantized representation: $\hat{\mathbf{x}} = g_\theta(\mathbf{z}_q)$.

The straight-through estimator. The $\arg\min$ operation is not differentiable, so gradients cannot flow from the decoder back through the quantization step to the encoder. VQ-VAE uses the straight-through estimator: during the forward pass, $\mathbf{z}_q$ is the quantized vector; during the backward pass, gradients are copied directly from $\mathbf{z}_q$ to $\mathbf{z}_e$, bypassing the quantization. In PyTorch:

# Straight-through estimator
z_q = z_e + (z_q_detached - z_e).detach()

The VQ-VAE loss has three terms:

$$\mathcal{L}_{\text{VQ-VAE}} = \underbrace{\|\mathbf{x} - \hat{\mathbf{x}}\|^2}_{\text{reconstruction}} + \underbrace{\|\text{sg}[\mathbf{z}_e] - \mathbf{e}_k\|^2}_{\text{codebook loss}} + \beta \underbrace{\|\mathbf{z}_e - \text{sg}[\mathbf{e}_k]\|^2}_{\text{commitment loss}}$$

where $\text{sg}[\cdot]$ denotes the stop-gradient operator. The codebook loss updates the codebook vectors toward the encoder outputs. The commitment loss prevents the encoder outputs from growing arbitrarily far from the codebook vectors (with $\beta = 0.25$ as a typical value).

Why VQ-VAE produces sharper images. Continuous VAEs must model a smooth, unimodal distribution in latent space (typically Gaussian), which leads to averaging over multiple modes and hence blurriness. VQ-VAE sidesteps this by using discrete codes --- each codebook entry can represent a sharp, specific pattern. Generation with VQ-VAE is typically a two-stage process: first, train the VQ-VAE; then, train an autoregressive model (like a PixelCNN or Transformer) over the discrete latent codes to model their distribution.

VQ-VAE-2 (Razavi et al., 2019) extends this to hierarchical discrete latent codes at multiple resolutions, achieving image quality competitive with the best GANs of its era. VQ-VAE was also the backbone of DALL-E 1 (Ramanan et al., 2021), which generated images from text descriptions using a Transformer trained on VQ-VAE codes.

16.8.3 Connection to Modern Latent Diffusion Models

The autoencoder family connects directly to the state-of-the-art in generative modeling: latent diffusion models (Rombach et al., 2022), which power systems like Stable Diffusion. The architecture is elegant in its simplicity:

Train a powerful autoencoder (typically a VQ-VAE or a regularized autoencoder with a KL penalty) that compresses images from pixel space (e.g., $512 \times 512 \times 3$) to a much smaller latent space (e.g., $64 \times 64 \times 4$).
Train a diffusion model in the compressed latent space rather than in pixel space.

This two-stage approach combines the best of both worlds: the autoencoder handles the perceptually redundant compression of pixel data, while the diffusion model focuses on the semantic structure of the latent representations. Training diffusion models in latent space is dramatically cheaper than in pixel space because the latent representations are 48x to 64x smaller.

The autoencoder used in Stable Diffusion is trained with a combination of reconstruction loss, a small KL penalty (to keep the latent space smooth), and a perceptual loss (comparing features extracted by a VGG network rather than raw pixels). The perceptual loss is critical: it ensures that the reconstruction preserves the visual structure of images even if individual pixel values differ, avoiding the blurriness that plagues MSE-only objectives.

This connection underscores why autoencoders remain central to modern AI: they provide the learned compression that makes computationally expensive generative models practical. As we will explore in Chapter 27, the quality of the autoencoder's latent space directly determines the quality of the final generated images.

16.8.4 Representation Learning Theory: What Makes a Good Representation?

Before moving to contrastive learning, it is worth stepping back to ask a fundamental question: what properties should a learned representation have? The answer depends on the downstream use case, but several principles have emerged:

Sufficiency. A representation $\mathbf{z}$ is sufficient for a task if all task-relevant information in $\mathbf{x}$ is preserved in $\mathbf{z}$. Formally, $\mathbf{x}$ and $\mathbf{z}$ should have the same mutual information with the target variable $Y$: $I(\mathbf{z}; Y) = I(\mathbf{x}; Y)$.

Minimality. Among all sufficient representations, we prefer the most compressed one --- the one with the least mutual information with the input: $\min I(\mathbf{z}; \mathbf{x})$ subject to $I(\mathbf{z}; Y) = I(\mathbf{x}; Y)$. This is the Information Bottleneck principle (Tishby et al., 2000), which provides a formal framework for the compression-relevance tradeoff.

Disentanglement. Each latent dimension should correspond to a single, interpretable factor of variation. Formally, the latent dimensions should be statistically independent: $p(\mathbf{z}) = \prod_j p(z_j)$. This is the goal of $\beta$-VAE and its variants.

Smoothness. Small changes in latent space should produce small changes in data space. This is encouraged by the KL penalty in VAEs and the Jacobian penalty in contractive autoencoders.

Invariance and equivariance. A representation should be invariant to nuisance factors (lighting, background) while being equivariant to task-relevant factors (pose, identity). Contrastive learning methods, which we explore next, explicitly encourage invariance to augmentations while preserving other information.

16.8.5 VAE Variants Summary

Variant	Key Idea	Advantage
Vanilla VAE	Gaussian latent, ELBO objective	Simple, principled generation
$\beta$-VAE	$\beta > 1$ on KL term	Disentangled representations
CVAE	Condition on labels	Controlled generation
VQ-VAE	Discrete codebook	Sharper reconstructions
IWAE	Importance-weighted ELBO	Tighter bound

16.9 Contrastive Learning

16.9.1 The Self-Supervised Learning Paradigm

Everything we have discussed so far uses reconstruction as the training signal. But reconstruction is not the only way to learn useful representations without labels. Self-supervised learning creates supervisory signals from the data itself, typically by defining a pretext task whose labels are free.

The insight is powerful: instead of asking "Can you reconstruct this image?", we ask "Can you tell which images are similar and which are different?" This is the essence of contrastive learning.

16.9.2 The Contrastive Learning Framework

Contrastive learning learns representations by:

Creating positive pairs: Two augmented views of the same data point.
Creating negative pairs: Views from different data points.
Training the encoder to produce similar representations for positive pairs and dissimilar representations for negative pairs.

The quality of the augmentations is crucial. For images, typical augmentations include random cropping, color jittering, Gaussian blur, horizontal flipping, and grayscale conversion. The augmentations must be strong enough to prevent trivial shortcuts (e.g., matching based on color statistics) but not so strong that the two views become unrecognizable as the same image.

16.9.3 SimCLR: A Simple Framework for Contrastive Learning

SimCLR (Chen et al., 2020) is an elegantly simple contrastive learning framework. Given a batch of $N$ images:

Augmentation: Each image $\mathbf{x}_i$ is augmented twice to produce $\tilde{\mathbf{x}}_i$ and $\tilde{\mathbf{x}}_i'$. This creates $2N$ augmented images.
Encoding: A base encoder $f$ (e.g., ResNet) maps each augmented image to a representation: $\mathbf{h}_i = f(\tilde{\mathbf{x}}_i)$.
Projection: A small MLP projection head $g$ maps representations to a space where the contrastive loss is applied: $\mathbf{z}_i = g(\mathbf{h}_i)$.
Contrastive loss (NT-Xent): For a positive pair $(i, j)$, the Normalized Temperature-scaled Cross-Entropy loss is:

$$\ell_{i,j} = -\log \frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j) / \tau)}{\sum_{k=1}^{2N} \mathbf{1}_{[k \neq i]} \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k) / \tau)}$$

where $\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}^\top \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$ is cosine similarity and $\tau$ is a temperature parameter.

The denominator sums over all $2N - 1$ other examples in the batch (both augmented views of all other images plus the other augmented view of the same image). This makes the batch size critical: larger batches provide more negative pairs, improving performance.

Key design decisions: - The projection head $g$ is discarded after pretraining; only $f$ is used for downstream tasks. This is because $g$ loses information useful for downstream tasks to optimize the contrastive objective. - Temperature $\tau$ controls the concentration of the distribution. Lower $\tau$ sharpens the distribution, making the model more sensitive to hard negatives. - Larger batch sizes (up to 8,192 in the original paper) significantly improve performance.

16.9.4 BYOL: Bootstrap Your Own Latent

BYOL (Grill et al., 2020) showed that contrastive learning does not actually need negative pairs. BYOL uses two networks:

Online network: Encoder $f_\theta$, projector $g_\theta$, and predictor $q_\theta$.
Target network: Encoder $f_\xi$ and projector $g_\xi$ (no predictor). The target parameters $\xi$ are an exponential moving average of the online parameters: $\xi \leftarrow m\xi + (1 - m)\theta$.

Given two augmented views of the same image, the online network predicts the target network's representation:

$$\mathcal{L}_{\text{BYOL}} = \left\| \bar{q}_\theta(\mathbf{z}_\theta) - \bar{\mathbf{z}}_\xi' \right\|^2$$

where $\bar{\cdot}$ denotes L2 normalization and $\mathbf{z}_\xi'$ is the target network's output (treated as a constant---no gradient).

The absence of negative pairs raises a natural question: why doesn't BYOL collapse to a trivial solution where both networks output a constant? The answer involves a subtle interplay between the predictor, the asymmetric architecture, and the moving average update. The predictor $q_\theta$ and the stop-gradient on the target network together prevent collapse.

16.9.5 Evaluating Learned Representations

Contrastive learning methods are evaluated by linear probing: freeze the pretrained encoder $f$, add a single linear layer, and train it on the downstream task with labels. High linear probe accuracy indicates that the encoder has learned features that are linearly separable---a strong indicator of representation quality.

Other evaluation protocols include: - Fine-tuning: Unfreeze the encoder and train the entire model. This measures the quality of the initialization. - k-NN evaluation: Use the encoder's representations directly with a k-nearest-neighbors classifier. No training required. - Transfer learning: Evaluate on a different dataset than the one used for pretraining.

16.10 The Self-Supervised Learning Landscape

16.10.1 Pretext Tasks Beyond Contrastive Learning

Contrastive learning is one family of self-supervised methods. Others include:

Masked Image Modeling (MAE): Mask random patches of an image and train the model to reconstruct them. This is the visual analog of BERT's masked language modeling (Chapter 20).

Rotation prediction: Rotate an image by $0°$, $90°$, $180°$, or $270°$ and train the model to predict the rotation angle. To solve this, the model must understand the content of the image.

Jigsaw puzzles: Divide an image into patches, shuffle them, and train the model to predict the correct arrangement.

Colorization: Convert a color image to grayscale and train the model to predict the colors.

Each pretext task encodes a different inductive bias about what constitutes useful visual knowledge.

16.10.2 From Representations to Foundation Models

Self-supervised learning has evolved from a technique for learning visual representations to the dominant paradigm for training foundation models---large models pretrained on vast unlabeled datasets that are then adapted to many downstream tasks.

In NLP, the self-supervised paradigm (predicting masked or next tokens) gave us BERT and GPT (Chapters 20--21). In computer vision, methods like DINO, MAE, and DINOv2 have shown that self-supervised visual pretraining can match or exceed supervised pretraining. The pattern is clear: learn general representations first, then specialize.

16.10.3 Comparison of Representation Learning Methods

Method	Training Signal	Key Strength	Key Weakness
Undercomplete AE	Reconstruction	Simple, deterministic	Unstructured latent space
Sparse AE	Reconstruction + sparsity	Interpretable features	Extra hyperparameter ($\lambda$)
Denoising AE	Reconstruction from corrupted input	Robust features, score learning	Corruption design choices
VAE	ELBO (reconstruction + KL)	Principled generation, smooth latent space	Blurry reconstructions
SimCLR	Contrastive (positive/negative pairs)	Excellent downstream performance	Requires large batch sizes
BYOL	Self-prediction (no negatives)	No need for negative pairs	Complex training dynamics
MAE	Masked reconstruction	Scalable, strong representations	Requires Vision Transformer backbone

16.11 Practical Considerations

16.11.1 Choosing the Right Method

The choice of representation learning method depends on your goal:

Dimensionality reduction / feature extraction: Start with an undercomplete autoencoder. If the latent space is poor, try a denoising autoencoder.
Anomaly detection: Train an autoencoder on normal data and use reconstruction error as an anomaly score. Denoising autoencoders are particularly effective here.
Generative modeling: Use a VAE if you need a probabilistic model with a structured latent space. For higher-quality generation, consider VQ-VAE or diffusion models (Chapter 18).
Pretraining for downstream tasks: Use contrastive learning (SimCLR) or masked modeling (MAE) when you have abundant unlabeled data and want to learn transferable representations.

16.11.2 Latent Dimension Selection

The latent dimension $k$ controls the trade-off between compression and reconstruction quality:

Too small: The bottleneck is too tight; the model loses important information.
Too large: The bottleneck is too loose; the model may not learn useful compression.

Rules of thumb: - Start with $k$ that gives a 10x--50x compression ratio. - For VAEs, 2D latent spaces are excellent for visualization but too small for good reconstruction. Use 32--256 dimensions for practical applications. - Use reconstruction error on a validation set to tune $k$.

16.11.3 Common Pitfalls

Overfitting: Autoencoders can memorize the training data, especially with high capacity. Use the same regularization techniques from Chapter 13 (dropout, weight decay, early stopping).
Mode collapse in VAEs: If all encoded points map to the same region in latent space, the model has collapsed. Check the KL divergence: if it is near zero, posterior collapse has occurred.
Blurry VAE reconstructions: The MSE loss and the Gaussian decoder assumption tend to produce blurry outputs because the model hedges its bets. Using a perceptual loss or adversarial training can help.
Ignoring the projection head in contrastive learning: Remember to discard the projection head after pretraining. Using $g(f(\mathbf{x}))$ instead of $f(\mathbf{x})$ for downstream tasks typically hurts performance.

16.11.4 Debugging Autoencoders

When your autoencoder is not working, here is a systematic debugging checklist:

Check the loss function matches the output activation. If the decoder output uses Sigmoid (range $[0, 1]$), use BCE loss. If it uses Tanh (range $[-1, 1]$) or no activation, use MSE loss. Mismatching these leads to poor training.
Verify input normalization. If images are in $[0, 255]$ but the decoder output is in $[0, 1]$, the MSE loss will be enormous and training will focus on matching the scale rather than the content. Always normalize inputs to match the decoder's output range.
Start with a simple architecture. If a deep convolutional autoencoder fails, try a simple fully connected one first. If that works, add complexity gradually.
For VAEs, check the KL term magnitude. If the KL term is orders of magnitude larger than the reconstruction term, the model will collapse to the prior. If it is orders of magnitude smaller, the model is acting as a deterministic autoencoder. The two terms should be roughly balanced.
Visualize reconstructions early and often. Plotting original and reconstructed images side by side after every few epochs reveals problems (blurriness, color shifts, artifacts) much faster than monitoring loss curves alone.

16.11.5 Computational Requirements

Method	Typical Training Time	GPU Memory	Key Bottleneck
Autoencoder (FC)	Minutes	Low	None
Conv Autoencoder	10--30 min	Moderate	None
VAE	10--30 min	Moderate	KL balancing
SimCLR	Hours to days	High (large batches)	Batch size
BYOL	Hours to days	High	EMA updates

16.12 Summary

This chapter traced the evolution of unsupervised representation learning from simple autoencoders to modern self-supervised methods:

Undercomplete autoencoders learn compressed representations by forcing data through a bottleneck. They generalize PCA to nonlinear settings.
Sparse autoencoders regularize activations rather than architecture, enabling overcomplete representations with interpretable features.
Denoising autoencoders learn robust representations by training on corrupted inputs, with deep theoretical connections to score matching.
Variational autoencoders place representation learning on probabilistic foundations. The ELBO provides a principled training objective, the reparameterization trick enables gradient-based optimization, and the KL regularizer produces smooth, structured latent spaces suitable for generation.
Contrastive learning (SimCLR, BYOL) abandons reconstruction entirely, instead learning representations by distinguishing similar and dissimilar data points. These methods produce representations that rival or surpass supervised pretraining.
Self-supervised learning is the unifying paradigm: create free supervision from the data itself, learn general representations, then adapt to specific tasks.

The representations learned by these methods are not just mathematical curiosities---they are the foundation upon which modern AI systems are built. Transfer learning, few-shot learning, and foundation models all rely on the idea that good representations, learned from large unlabeled datasets, transfer to new tasks with minimal labeled data.

In the next chapter, we explore Generative Adversarial Networks, which take a fundamentally different approach to generation: instead of maximizing a likelihood, they pit a generator against a discriminator in a game-theoretic framework.

References

Hinton, G. E. and Salakhutdinov, R. R. (2006). "Reducing the Dimensionality of Data with Neural Networks." Science, 313(5786), 504--507.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). "Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion." JMLR, 11, 3371--3408.
Kingma, D. P. and Welling, M. (2014). "Auto-Encoding Variational Bayes." ICLR 2014.
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). "Stochastic Backpropagation and Approximate Inference in Deep Generative Models." ICML 2014.
Higgins, I., Matthey, L., Pal, A., et al. (2017). "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework." ICLR 2017.
van den Oord, A., Vinyals, O., and Kavukcuoglu, K. (2017). "Neural Discrete Representation Learning." NeurIPS 2017.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." ICML 2020.
Grill, J.-B., Strub, F., Altche, F., et al. (2020). "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning." NeurIPS 2020.
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. (2022). "Masked Autoencoders Are Scalable Vision Learners." CVPR 2022.