Chapter 19 Key Takeaways

The Big Picture

The Transformer architecture replaced recurrence with attention as the primary mechanism for sequence modeling. By processing all positions in parallel through self-attention, the Transformer is faster to train, easier to scale, and better at capturing long-range dependencies than RNN-based models. The architecture from "Attention Is All You Need" (2017) remains the foundation of virtually every modern language model.

Architecture Components

Positional Encoding

Self-attention is permutation-equivariant --- it has no inherent notion of order.
Sinusoidal positional encodings use sine/cosine functions at different frequencies to create unique position fingerprints.
Learned positional embeddings are more flexible but limited to the maximum training length.
Positional encodings are added (not concatenated) to token embeddings.

Layer Normalization

Normalizes across the feature dimension: $\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$.
Pre-norm (before sublayer) provides more stable training than post-norm (after residual).
Independent of batch size, unlike batch normalization.

Feed-Forward Network

Two-layer MLP applied independently at each position: $\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2$.
Inner dimension is typically $4 \times d_{\text{model}}$, creating an expand-then-compress bottleneck.
The primary source of nonlinearity and the majority of parameters per layer.

Residual Connections

$\mathbf{y} = \mathbf{x} + \text{Sublayer}(\text{LayerNorm}(\mathbf{x}))$ (pre-norm).
Create a "gradient highway" that prevents vanishing gradients in deep networks.
The residual stream perspective: each sublayer reads from and writes additive updates to a shared information stream.

Encoder vs. Decoder

Property	Encoder	Decoder
Self-attention	Unmasked (bidirectional)	Causally masked
Cross-attention	None	Attends to encoder output
Sublayers per block	2 (self-attn + FFN)	3 (self-attn + cross-attn + FFN)
Purpose	Build rich input representation	Generate output autoregressively

Training

Teacher forcing: Provide ground-truth previous tokens to the decoder during training.
Causal mask: Ensures the decoder cannot see future tokens, even when all target tokens are provided simultaneously.
Label smoothing ($\epsilon = 0.1$): Distributes probability mass to non-target tokens, improving generalization.
Warm-up schedule: Linear learning rate increase for the first 4,000 steps, then inverse square root decay.
Gradient clipping: Clip gradient norms to 1.0 to prevent training instabilities.

Architecture Variants

Variant	Examples	Best For
Encoder-only	BERT, RoBERTa	Understanding tasks (classification, NER, QA)
Decoder-only	GPT, LLaMA	Generation tasks (text, code)
Encoder-decoder	T5, BART	Seq-to-seq tasks (translation, summarization)

Key Numbers

Component	Original Paper	Formula
$d_{\text{model}}$	512	---
$d_{\text{ff}}$	2048	$4 \times d_{\text{model}}$
Heads $h$	8	---
$d_k = d_v$	64	$d_{\text{model}} / h$
Layers $N$	6 (enc + dec)	---
Base model params	~65M	---
Big model params	~213M	---

Practical Insights

Embedding scaling: Multiply embeddings by $\sqrt{d_{\text{model}}}$ so they are on the same scale as positional encodings.
Weight tying: Share the decoder embedding and output projection weights to reduce parameters and improve performance.
Xavier initialization: Initialize weight matrices with Xavier uniform for stable training.
Mixed precision: Use FP16/BF16 for ~2x speedup and ~50% memory reduction.
FFN dominates compute for short sequences: Attention is $O(n^2 d)$ while FFN is $O(n d^2)$; attention only dominates when $n > d$.

Common Pitfalls

Missing causal mask in the decoder: The model will "cheat" by looking at future tokens, achieving low training loss but generating nonsense.
Applying causal mask to the encoder: The encoder should be bidirectional.
Forgetting to scale embeddings: Without $\sqrt{d_{\text{model}}}$ scaling, positional signals are drowned out.
Too-high learning rate without warmup: Causes training to diverge, especially for larger models.
Ignoring padding in attention: Padding tokens receive attention weight, contaminating representations.

Looking Ahead

Chapter 20: Pre-training and transfer learning with BERT, T5, and the HuggingFace ecosystem.
Chapter 21: Decoder-only models (GPT family) and autoregressive text generation.
Chapter 22: Scaling laws and the path from millions to billions of parameters.