Chapter 21 Key Takeaways

The Big Picture

Decoder-only Transformers, trained with the autoregressive next-token prediction objective, form the foundation of modern large language models. The GPT family demonstrated that this deceptively simple framework---predict the next word given all previous words---when scaled to billions of parameters and trained on massive text corpora, produces models with remarkable language understanding and generation capabilities. The architecture has remained remarkably stable across generations; the primary lever has been scale.


Autoregressive Language Modeling

  • Autoregressive factorization: $P(\mathbf{x}) = \prod_{t=1}^{T} P(x_t \mid x_1, \ldots, x_{t-1})$. This is exact (no approximations) and enables straightforward sequential generation.
  • Training objective: Minimize the cross-entropy loss $\mathcal{L} = -\frac{1}{N} \sum_{t=1}^{N} \log P_\theta(x_t \mid \mathbf{x}_{
  • Perplexity: $\text{PPL} = \exp(\mathcal{L})$. Lower is better. Interpretable as the effective number of equally likely next tokens.
  • Parallel training: Despite sequential generation, training processes all positions simultaneously using causal masking.

Causal (Left-to-Right) Masking

  • Each position can only attend to itself and previous positions: $M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}$
  • Prevents "information leakage" from future tokens during training.
  • Without causal masking, the model can "cheat" by copying answers from future positions, achieving low training loss but failing at generation.

GPT Architecture Evolution

Feature GPT-1 (2018) GPT-2 (2019) GPT-3 (2020)
Parameters 117M 1.5B 175B
Layers 12 48 96
Hidden dim 768 1600 12288
Context length 512 1024 2048
Layer norm Post-norm Pre-norm Pre-norm
Key innovation Pre-train + fine-tune Zero-shot transfer In-context learning

Core Architecture (Shared Across All GPT Models)

  1. Token embedding + learned positional embedding
  2. Stack of $L$ decoder blocks (masked self-attention + FFN)
  3. Final layer norm
  4. Language modeling head (linear projection to vocabulary, weight-tied with token embedding)

Key Design Decisions

  • Pre-norm (GPT-2+): Layer norm before each sublayer, not after. More stable training.
  • GELU activation: Smoother than ReLU, standard in GPT models.
  • Weight tying: Input embedding and output projection share the same matrix.
  • Residual scaling: GPT-2 scales residual connections by $1/\sqrt{N}$ at initialization.

Mini-GPT Implementation

  • GPTConfig dataclass: Centralized hyperparameters (vocab_size, block_size, n_layer, n_head, n_embd, dropout).
  • CausalSelfAttention: Combined QKV projection, reshape for multi-head, apply causal mask via torch.tril, output projection.
  • FeedForward: Two linear layers with GELU activation and dropout. Inner dimension = $4 \times d_{\text{model}}$.
  • TransformerBlock: Pre-norm with x = x + attn(LN(x)) and x = x + ffn(LN(x)).
  • MiniGPT: Embedding layers + stacked blocks + final layer norm + language modeling head with weight tying.

Text Generation Strategies

Strategy Mechanism Pros Cons
Greedy Always pick argmax Deterministic, fast Repetitive, boring
Temperature ($\tau$) Divide logits by $\tau$ Controls randomness Does not prevent unlikely tokens
Top-k Sample from $k$ most likely Prevents low-prob tokens Fixed $k$ ignores distribution shape
Top-p (nucleus) Sample from smallest set with cumulative prob $\geq p$ Adapts to distribution Slightly more complex

Temperature Effects

  • $\tau \to 0$: Approaches greedy (peaked distribution)
  • $\tau = 1$: Original distribution
  • $\tau > 1$: Flatter distribution (more random)

Practical Recommendations

  • Default: nucleus sampling with $p = 0.9$--$0.95$, temperature $0.7$--$0.9$
  • Add repetition penalty ($\alpha = 1.1$--$1.3$) for open-ended generation
  • Use greedy or low temperature for deterministic tasks

KV Caching

  • During autoregressive generation, re-computing attention over all previous tokens at each step is wasteful.
  • KV cache: Store the key and value projections from all previous tokens. At each step, only compute Q, K, V for the new token.
  • Reduces per-step computation from $O(T \cdot d)$ to $O(d)$ for the attention projections.
  • Memory cost: $2 \times L \times T \times d$ per sequence (keys + values across all layers).
  • Essential for practical deployment of large language models.

Key Numbers

Component Mini-GPT (Chapter) GPT-2 Small
$d_{\text{model}}$ 384 768
$d_{\text{ff}}$ 1536 3072
Heads 6 12
Layers 6 12
Context length 256 1024
Parameters ~30M ~124M

Practical Insights

  1. Weight tying reduces parameters significantly: For a 50K vocabulary with 768-dim embeddings, tying saves ~38M parameters.
  2. Pre-norm is strictly better for training stability at the cost of slightly reduced effective depth.
  3. Character-level models learn spelling from scratch, which is sample-inefficient. BPE tokenization is far more practical.
  4. Gradient clipping (norm $\leq 1.0$) is essential for stable Transformer training.
  5. Learning rate warmup followed by cosine or inverse-sqrt decay is the standard schedule.
  6. Larger models learn faster per token (better sample efficiency) but require more compute per step.

Common Pitfalls

  1. Forgetting the causal mask: The model achieves near-zero training loss but generates nonsense.
  2. Not using KV caching during inference: Generation is $O(T^2)$ instead of $O(T)$ in sequence length.
  3. Greedy decoding for open-ended generation: Produces repetitive, degenerate text.
  4. Too-high temperature: Makes the model produce incoherent text.
  5. Exceeding the context window: The model has no mechanism to attend beyond its maximum context length.
  6. Not scaling embeddings by $\sqrt{d}$: Some implementations scale, others do not. Be consistent with the pre-trained model's convention.

Decoder-Only vs. Encoder-Only vs. Encoder-Decoder

Property Decoder-Only (GPT) Encoder-Only (BERT) Encoder-Decoder (T5)
Attention Causal (unidirectional) Bidirectional Both
Best for Generation, in-context learning Classification, NLU Seq-to-seq, translation
Pre-training Next-token prediction Masked LM Span corruption
Inference Sequential (autoregressive) Single pass Encoder once + decode

Looking Ahead

  • Chapter 22: Scaling laws---how performance relates to model size, data, and compute.
  • Chapter 23: Fine-tuning with human feedback (RLHF) and instruction tuning.
  • Chapter 24: Prompt engineering and in-context learning.