Chapter 16: Key Takeaways

Core Concepts

Autoencoders learn compressed representations by training a network to reconstruct its own input through a bottleneck. The encoder maps input to a low-dimensional latent code; the decoder reconstructs the input from that code. The bottleneck forces the network to discover the most important factors of variation in the data.
A linear autoencoder with MSE loss recovers the same subspace as PCA. This serves as both a theoretical connection and a sanity check: nonlinear autoencoders extend PCA to curved manifolds, and every autoencoder should at least match PCA performance.
Representation learning is the foundation of modern deep learning. The goal is not just compression, but learning features that transfer to downstream tasks. Good representations make classification, clustering, and generation easier.

Autoencoder Variants

Undercomplete autoencoders constrain capacity through architecture ($k < d$). The bottleneck dimension $k$ controls the compression ratio. Too small and you lose information; too large and the model may learn the identity. For MNIST ($d = 784$), $k = 32$ is a reasonable starting point.
Sparse autoencoders constrain capacity through activation regularization. L1 penalties or KL divergence sparsity encourage most latent units to be inactive for any given input. This enables overcomplete representations ($k > d$) where each input activates a different sparse subset of features.
Denoising autoencoders constrain capacity through input corruption. By training to reconstruct clean inputs from corrupted versions, DAEs learn robust features that capture the data distribution rather than surface-level details. In the limit of small noise, DAEs learn the score function $\nabla_\mathbf{x} \log p(\mathbf{x})$.

Variational Autoencoders

VAEs are probabilistic generative models, not just autoencoders. The encoder outputs the parameters of a distribution (mean and variance), not a deterministic code. This enables principled generation by sampling from the latent space.
The ELBO is the key to training VAEs. The Evidence Lower Bound = Reconstruction term $-$ KL divergence. Maximizing the ELBO simultaneously fits the data and keeps the approximate posterior close to the prior. The gap between the ELBO and the true log-evidence equals $D_{\text{KL}}(q_\phi \| p_\theta(\mathbf{z}|\mathbf{x}))$.
The reparameterization trick makes VAE training possible. Writing $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$ with $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ moves the stochasticity outside the computational graph, allowing gradients to flow through the sampling step.
Posterior collapse is the main failure mode of VAEs. When the encoder ignores the input and the posterior matches the prior, the KL term drops to zero and the model degenerates. KL annealing (gradually increasing $\beta$ from 0 to 1) is the standard mitigation.
VAE latent spaces are smooth and continuous. The KL regularization produces a latent space where interpolation is meaningful, nearby points decode to similar outputs, and sampling from the prior generates plausible data. This is in contrast to deterministic autoencoders, which can have fragmented, unstructured latent spaces.
$\beta$-VAE trades reconstruction quality for disentanglement. Setting $\beta > 1$ increases pressure toward a factorized posterior, encouraging each latent dimension to capture a single factor of variation. The optimal $\beta$ depends on the application.

Contrastive and Self-Supervised Learning

Contrastive learning learns representations by distinguishing similar from dissimilar pairs. SimCLR creates positive pairs via data augmentation and uses the NT-Xent loss to pull augmented views of the same image together while pushing different images apart. Large batch sizes provide more negatives and improve performance.
BYOL demonstrates that negative pairs are not strictly necessary. An asymmetric architecture (online network with predictor, target network with EMA update) prevents collapse without explicit negatives, challenging the assumption that contrastive learning requires negative examples.
Data augmentation strategy defines the invariances learned. The choice of augmentations is the most important design decision in contrastive learning. Strong, diverse augmentations (crop, flip, color jitter, blur) produce the best representations.
Discard the projection head for downstream tasks. The projection head optimizes the contrastive objective at the expense of downstream-relevant information. Always use the encoder's representation (before the projection head) for transfer.

Practical Guidance

Choose the method based on your goal. Anomaly detection: autoencoder with reconstruction error. Generation: VAE. Pretraining for classification: contrastive learning. Interpretable features: sparse autoencoder. Visualization: VAE with 2D latent space.
Evaluate representations through downstream tasks, not just reconstruction loss. Linear probing (freeze encoder, train linear classifier) is the gold standard for evaluating self-supervised representations. Low reconstruction loss alone does not guarantee useful features.
Self-supervised learning has become the dominant pretraining paradigm. From autoencoders to contrastive learning to masked modeling, the field has converged on a pattern: learn general representations from unlabeled data, then adapt to specific tasks with minimal labels. This is the foundation of modern foundation models.