Chapter 21 Key Takeaways

The Big Picture

Decoder-only Transformers, trained with the autoregressive next-token prediction objective, form the foundation of modern large language models. The GPT family demonstrated that this deceptively simple framework---predict the next word given all previous words---when scaled to billions of parameters and trained on massive text corpora, produces models with remarkable language understanding and generation capabilities. The architecture has remained remarkably stable across generations; the primary lever has been scale.

Autoregressive Language Modeling

Autoregressive factorization: $P(\mathbf{x}) = \prod_{t=1}^{T} P(x_t \mid x_1, \ldots, x_{t-1})$. This is exact (no approximations) and enables straightforward sequential generation.
Training objective: Minimize the cross-entropy loss $\mathcal{L} = -\frac{1}{N} \sum_{t=1}^{N} \log P_\theta(x_t \mid \mathbf{x}_{
Perplexity: $\text{PPL} = \exp(\mathcal{L})$. Lower is better. Interpretable as the effective number of equally likely next tokens.
Parallel training: Despite sequential generation, training processes all positions simultaneously using causal masking.

Causal (Left-to-Right) Masking

Each position can only attend to itself and previous positions: $M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}$
Prevents "information leakage" from future tokens during training.
Without causal masking, the model can "cheat" by copying answers from future positions, achieving low training loss but failing at generation.

GPT Architecture Evolution

Feature	GPT-1 (2018)	GPT-2 (2019)	GPT-3 (2020)
Parameters	117M	1.5B	175B
Layers	12	48	96
Hidden dim	768	1600	12288
Context length	512	1024	2048
Layer norm	Post-norm	Pre-norm	Pre-norm
Key innovation	Pre-train + fine-tune	Zero-shot transfer	In-context learning

Core Architecture (Shared Across All GPT Models)

Token embedding + learned positional embedding
Stack of $L$ decoder blocks (masked self-attention + FFN)
Final layer norm
Language modeling head (linear projection to vocabulary, weight-tied with token embedding)

Key Design Decisions

Pre-norm (GPT-2+): Layer norm before each sublayer, not after. More stable training.
GELU activation: Smoother than ReLU, standard in GPT models.
Weight tying: Input embedding and output projection share the same matrix.
Residual scaling: GPT-2 scales residual connections by $1/\sqrt{N}$ at initialization.

Mini-GPT Implementation

GPTConfig dataclass: Centralized hyperparameters (vocab_size, block_size, n_layer, n_head, n_embd, dropout).
CausalSelfAttention: Combined QKV projection, reshape for multi-head, apply causal mask via torch.tril, output projection.
FeedForward: Two linear layers with GELU activation and dropout. Inner dimension = $4 \times d_{\text{model}}$.
TransformerBlock: Pre-norm with x = x + attn(LN(x)) and x = x + ffn(LN(x)).
MiniGPT: Embedding layers + stacked blocks + final layer norm + language modeling head with weight tying.

Text Generation Strategies

Strategy	Mechanism	Pros	Cons
Greedy	Always pick argmax	Deterministic, fast	Repetitive, boring
Temperature ($\tau$)	Divide logits by $\tau$	Controls randomness	Does not prevent unlikely tokens
Top-k	Sample from $k$ most likely	Prevents low-prob tokens	Fixed $k$ ignores distribution shape
Top-p (nucleus)	Sample from smallest set with cumulative prob $\geq p$	Adapts to distribution	Slightly more complex

Temperature Effects

$\tau \to 0$: Approaches greedy (peaked distribution)
$\tau = 1$: Original distribution
$\tau > 1$: Flatter distribution (more random)

Practical Recommendations

Default: nucleus sampling with $p = 0.9$--$0.95$, temperature $0.7$--$0.9$
Add repetition penalty ($\alpha = 1.1$--$1.3$) for open-ended generation
Use greedy or low temperature for deterministic tasks

KV Caching

During autoregressive generation, re-computing attention over all previous tokens at each step is wasteful.
KV cache: Store the key and value projections from all previous tokens. At each step, only compute Q, K, V for the new token.
Reduces per-step computation from $O(T \cdot d)$ to $O(d)$ for the attention projections.
Memory cost: $2 \times L \times T \times d$ per sequence (keys + values across all layers).
Essential for practical deployment of large language models.

Key Numbers

Component	Mini-GPT (Chapter)	GPT-2 Small
$d_{\text{model}}$	384	768
$d_{\text{ff}}$	1536	3072
Heads	6	12
Layers	6	12
Context length	256	1024
Parameters	~30M	~124M

Practical Insights

Weight tying reduces parameters significantly: For a 50K vocabulary with 768-dim embeddings, tying saves ~38M parameters.
Pre-norm is strictly better for training stability at the cost of slightly reduced effective depth.
Character-level models learn spelling from scratch, which is sample-inefficient. BPE tokenization is far more practical.
Gradient clipping (norm $\leq 1.0$) is essential for stable Transformer training.
Learning rate warmup followed by cosine or inverse-sqrt decay is the standard schedule.
Larger models learn faster per token (better sample efficiency) but require more compute per step.

Common Pitfalls

Forgetting the causal mask: The model achieves near-zero training loss but generates nonsense.
Not using KV caching during inference: Generation is $O(T^2)$ instead of $O(T)$ in sequence length.
Greedy decoding for open-ended generation: Produces repetitive, degenerate text.
Too-high temperature: Makes the model produce incoherent text.
Exceeding the context window: The model has no mechanism to attend beyond its maximum context length.
Not scaling embeddings by $\sqrt{d}$: Some implementations scale, others do not. Be consistent with the pre-trained model's convention.

Decoder-Only vs. Encoder-Only vs. Encoder-Decoder

Property	Decoder-Only (GPT)	Encoder-Only (BERT)	Encoder-Decoder (T5)
Attention	Causal (unidirectional)	Bidirectional	Both
Best for	Generation, in-context learning	Classification, NLU	Seq-to-seq, translation
Pre-training	Next-token prediction	Masked LM	Span corruption
Inference	Sequential (autoregressive)	Single pass	Encoder once + decode

Looking Ahead

Chapter 22: Scaling laws---how performance relates to model size, data, and compute.
Chapter 23: Fine-tuning with human feedback (RLHF) and instruction tuning.
Chapter 24: Prompt engineering and in-context learning.