Chapter 19 Key Takeaways

The Big Picture

The Transformer architecture replaced recurrence with attention as the primary mechanism for sequence modeling. By processing all positions in parallel through self-attention, the Transformer is faster to train, easier to scale, and better at capturing long-range dependencies than RNN-based models. The architecture from "Attention Is All You Need" (2017) remains the foundation of virtually every modern language model.


Architecture Components

Positional Encoding

  • Self-attention is permutation-equivariant --- it has no inherent notion of order.
  • Sinusoidal positional encodings use sine/cosine functions at different frequencies to create unique position fingerprints.
  • Learned positional embeddings are more flexible but limited to the maximum training length.
  • Positional encodings are added (not concatenated) to token embeddings.

Layer Normalization

  • Normalizes across the feature dimension: $\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$.
  • Pre-norm (before sublayer) provides more stable training than post-norm (after residual).
  • Independent of batch size, unlike batch normalization.

Feed-Forward Network

  • Two-layer MLP applied independently at each position: $\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2$.
  • Inner dimension is typically $4 \times d_{\text{model}}$, creating an expand-then-compress bottleneck.
  • The primary source of nonlinearity and the majority of parameters per layer.

Residual Connections

  • $\mathbf{y} = \mathbf{x} + \text{Sublayer}(\text{LayerNorm}(\mathbf{x}))$ (pre-norm).
  • Create a "gradient highway" that prevents vanishing gradients in deep networks.
  • The residual stream perspective: each sublayer reads from and writes additive updates to a shared information stream.

Encoder vs. Decoder

Property Encoder Decoder
Self-attention Unmasked (bidirectional) Causally masked
Cross-attention None Attends to encoder output
Sublayers per block 2 (self-attn + FFN) 3 (self-attn + cross-attn + FFN)
Purpose Build rich input representation Generate output autoregressively

Training

  • Teacher forcing: Provide ground-truth previous tokens to the decoder during training.
  • Causal mask: Ensures the decoder cannot see future tokens, even when all target tokens are provided simultaneously.
  • Label smoothing ($\epsilon = 0.1$): Distributes probability mass to non-target tokens, improving generalization.
  • Warm-up schedule: Linear learning rate increase for the first 4,000 steps, then inverse square root decay.
  • Gradient clipping: Clip gradient norms to 1.0 to prevent training instabilities.

Architecture Variants

Variant Examples Best For
Encoder-only BERT, RoBERTa Understanding tasks (classification, NER, QA)
Decoder-only GPT, LLaMA Generation tasks (text, code)
Encoder-decoder T5, BART Seq-to-seq tasks (translation, summarization)

Key Numbers

Component Original Paper Formula
$d_{\text{model}}$ 512 ---
$d_{\text{ff}}$ 2048 $4 \times d_{\text{model}}$
Heads $h$ 8 ---
$d_k = d_v$ 64 $d_{\text{model}} / h$
Layers $N$ 6 (enc + dec) ---
Base model params ~65M ---
Big model params ~213M ---

Practical Insights

  1. Embedding scaling: Multiply embeddings by $\sqrt{d_{\text{model}}}$ so they are on the same scale as positional encodings.
  2. Weight tying: Share the decoder embedding and output projection weights to reduce parameters and improve performance.
  3. Xavier initialization: Initialize weight matrices with Xavier uniform for stable training.
  4. Mixed precision: Use FP16/BF16 for ~2x speedup and ~50% memory reduction.
  5. FFN dominates compute for short sequences: Attention is $O(n^2 d)$ while FFN is $O(n d^2)$; attention only dominates when $n > d$.

Common Pitfalls

  1. Missing causal mask in the decoder: The model will "cheat" by looking at future tokens, achieving low training loss but generating nonsense.
  2. Applying causal mask to the encoder: The encoder should be bidirectional.
  3. Forgetting to scale embeddings: Without $\sqrt{d_{\text{model}}}$ scaling, positional signals are drowned out.
  4. Too-high learning rate without warmup: Causes training to diverge, especially for larger models.
  5. Ignoring padding in attention: Padding tokens receive attention weight, contaminating representations.

Looking Ahead

  • Chapter 20: Pre-training and transfer learning with BERT, T5, and the HuggingFace ecosystem.
  • Chapter 21: Decoder-only models (GPT family) and autoregressive text generation.
  • Chapter 22: Scaling laws and the path from millions to billions of parameters.