In 2018, a relatively simple idea changed the trajectory of artificial intelligence: take the decoder half of the Transformer architecture, train it to predict the next token on a massive corpus of text, and scale it up. That idea, crystallized in...
In This Chapter
- 21.1 Introduction: The Rise of Autoregressive Language Models
- 21.2 The Autoregressive Factorization of Language
- 21.3 Causal (Left-to-Right) Masking
- 21.4 The GPT Architecture Family
- 21.5 Decoder-Only Architecture in Detail
- 21.6 Implementing a Mini-GPT in PyTorch
- 21.7 Text Generation Strategies
- 21.8 KV Caching for Efficient Generation
- 21.9 Using GPT-2 with HuggingFace Transformers
- 21.10 Training a Mini-GPT on Text Data
- 21.11 Repetition Penalties and Advanced Decoding
- 21.12 Understanding Attention Patterns in Decoder Models
- 21.13 Positional Encoding Variants
- 21.14 Decoder-Only Models Beyond GPT
- 21.15 Speculative Decoding
- 21.16 Common Pitfalls and Debugging Tips
- 21.17 Practical Generation with HuggingFace: Extended Examples
- 21.18 From Language Modeling to Instruction Following
- 21.19 Chapter Summary
Chapter 21: Decoder-Only Models and Autoregressive Language Models
21.1 Introduction: The Rise of Autoregressive Language Models
In 2018, a relatively simple idea changed the trajectory of artificial intelligence: take the decoder half of the Transformer architecture, train it to predict the next token on a massive corpus of text, and scale it up. That idea, crystallized in OpenAI's GPT (Generative Pre-trained Transformer), launched a revolution that would eventually produce ChatGPT, Claude, and the broader large language model (LLM) ecosystem we know today.
This chapter is dedicated to understanding decoder-only Transformer models and the autoregressive generation paradigm that powers them. While Chapter 19 introduced the full encoder-decoder Transformer and Chapter 20 explored encoder-only models like BERT, here we focus exclusively on the left-to-right, next-token-prediction framework. We will build intuition for why this seemingly simple objective produces such capable models, walk through the architectural details that distinguish decoder-only models from their encoder counterparts, and implement a working mini-GPT from scratch in PyTorch.
By the end of this chapter, you will be able to:
- Explain the autoregressive factorization of language and why it is a natural fit for text generation.
- Describe the causal (left-to-right) masking mechanism and how it differs from bidirectional attention.
- Trace the architectural evolution from GPT-1 through GPT-3 and understand the key design decisions at each stage.
- Implement a complete mini-GPT model in PyTorch, including causal self-attention, positional embeddings, and the language modeling head.
- Apply multiple text generation strategies---greedy decoding, top-k sampling, top-p (nucleus) sampling, and temperature scaling---and understand their trade-offs.
- Explain KV caching and why it is essential for efficient autoregressive generation.
- Use HuggingFace's
transformerslibrary to load GPT-2 and generate text with fine-grained control over decoding parameters.
Let us begin with the mathematical foundation that underpins every autoregressive language model.
21.2 The Autoregressive Factorization of Language
21.2.1 From Joint to Conditional Probabilities
A language model assigns a probability to a sequence of tokens $\mathbf{x} = (x_1, x_2, \ldots, x_T)$. The most general way to write this is as a joint probability:
$$P(\mathbf{x}) = P(x_1, x_2, \ldots, x_T)$$
By the chain rule of probability, we can decompose this joint distribution into a product of conditional distributions:
$$P(\mathbf{x}) = \prod_{t=1}^{T} P(x_t \mid x_1, x_2, \ldots, x_{t-1}) = \prod_{t=1}^{T} P(x_t \mid \mathbf{x}_{ This decomposition is exact---no approximations are involved. It simply states that the probability of a sequence equals the probability of the first token, times the probability of the second token given the first, times the probability of the third token given the first two, and so on. This left-to-right factorization is called the autoregressive factorization, and a model that parameterizes each conditional $P(x_t \mid \mathbf{x}_{ The term "autoregressive" comes from time series analysis, where it refers to a model that predicts the current value from previous values. In the language modeling context, autoregressive means that each token is predicted based only on the tokens that came before it---never on future tokens. This left-to-right ordering has several appealing properties: Natural generation: To generate text, we simply sample one token at a time from left to right, conditioning on all previously generated tokens. There is no need for iterative refinement or complex decoding schemes. Exact likelihood computation: Given a sequence, we can compute its exact log-likelihood as the sum of log-probabilities of each token given its prefix. No independence assumptions: Unlike bag-of-words models or n-gram models with fixed context windows, an autoregressive Transformer can (in principle) condition on the entire preceding context. Universal density estimation: The chain rule decomposition is exact, so an autoregressive model with sufficient capacity can represent any distribution over sequences. Training an autoregressive language model reduces to a supervised learning problem. Given a training corpus tokenized into a sequence $(x_1, x_2, \ldots, x_N)$, the training objective is to maximize the log-likelihood: $$\mathcal{L}(\theta) = \sum_{t=1}^{N} \log P_\theta(x_t \mid x_1, \ldots, x_{t-1})$$ Equivalently, we minimize the cross-entropy loss between the model's predicted distribution and the one-hot target distribution at each position: $$\mathcal{L}_{\text{CE}}(\theta) = -\frac{1}{N} \sum_{t=1}^{N} \log P_\theta(x_t \mid \mathbf{x}_{ In practice, the training corpus is divided into fixed-length chunks (the context window or block size), and the model processes each chunk in parallel. Within each chunk of length $T$, the model predicts $T$ tokens simultaneously---position 1 predicts token 2, position 2 predicts token 3, and so on. This is made possible by causal masking, which we discuss next. Recall from Chapter 19 that the standard self-attention mechanism allows each position to attend to all other positions in the sequence. For an encoder model like BERT (Chapter 20), this is desirable---the representation of each token can incorporate information from both past and future context. But for an autoregressive model, bidirectional attention would be cheating: if position $t$ can attend to position $t+1$, the model could simply copy the answer rather than learn to predict it. To enforce the autoregressive property, we apply a causal mask (also called a look-ahead mask) to the attention scores. Before the softmax operation, we set all attention scores $e_{ij}$ where $j > i$ to $-\infty$: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right) V$$ where the mask $M$ is an upper-triangular matrix: $$M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}$$ After the softmax, the $-\infty$ entries become zero, so each position $i$ attends only to positions $j \leq i$ (itself and all preceding positions). This ensures that the representation at position $t$ depends only on tokens $x_1, \ldots, x_t$, preserving the autoregressive property. For a sequence of length 5, the causal attention mask looks like: $$M = \begin{pmatrix} 0 & -\infty & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty & -\infty \\ 0 & 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 & 0 \end{pmatrix}$$ Token 1 can attend only to itself. Token 2 can attend to tokens 1 and 2. Token 3 can attend to tokens 1, 2, and 3. And so on. This creates a triangular attention pattern that is the hallmark of decoder-only models. In PyTorch, we typically create the causal mask using This mask is then applied inside the attention computation, typically by setting masked positions to a large negative value (e.g., A key insight is that during training, causal masking allows the model to compute predictions for all positions in parallel. Even though generation is sequential, training is not. The causal mask ensures that position $t$ cannot "see" tokens after position $t$, so all $T$ predictions within a chunk can be computed in a single forward pass. This is what makes Transformer-based language models so much more efficient to train than RNN-based ones, which must process tokens sequentially. OpenAI's GPT-1 (Radford et al., 2018) was the first major demonstration that a Transformer decoder trained with a language modeling objective could produce powerful general-purpose representations. The key ideas: The GPT-1 architecture is essentially the Transformer decoder from Vaswani et al. (2017), but without cross-attention layers (since there is no encoder to attend to). Each layer consists of: GPT-2 (Radford et al., 2019) scaled up GPT-1 and introduced the idea that a sufficiently large language model could perform multiple tasks without any fine-tuning---a concept they called zero-shot learning. Key differences from GPT-1: Architectural refinements in GPT-2: GPT-2 demonstrated remarkable zero-shot capabilities on tasks like reading comprehension, summarization, and translation, despite never being explicitly trained on these tasks. This was evidence that the next-token prediction objective, at sufficient scale, implicitly learns a wide range of language understanding capabilities. GPT-3 (Brown et al., 2020) was a landmark in scale, with 175 billion parameters trained on a diverse corpus of ~570 GB of filtered text. Its primary contribution was demonstrating in-context learning: the ability to perform new tasks simply by providing a few examples in the prompt, without any gradient updates. Key features of GPT-3: GPT-3 showed that performance on downstream tasks scaled smoothly with model size, and that few-shot performance approached or exceeded fine-tuned smaller models on many benchmarks. This established the paradigm of prompting as an alternative to fine-tuning. While OpenAI released fewer technical details about GPT-4, several key themes emerged from the technical report (OpenAI, 2023): The evolution from GPT-1 to GPT-4 illustrates a remarkable pattern: the core autoregressive decoder architecture changed surprisingly little, while scale, data quality, training infrastructure, and post-training alignment techniques drove the dramatic improvements in capability. An important inflection point in the GPT lineage came not from OpenAI but from DeepMind. Hoffmann et al. (2022) published the Chinchilla paper, which demonstrated that most large language models were significantly undertrained relative to their size. The key finding was a scaling law relating optimal model size $N$ and training tokens $D$ for a fixed compute budget $C$: $$N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}$$ This means that model size and training data should be scaled in roughly equal proportion. Prior to Chinchilla, the convention (following the Kaplan et al. scaling laws) favored larger models trained on relatively fewer tokens. The Chinchilla analysis showed that a 70B parameter model trained on 1.4 trillion tokens could outperform the 280B parameter Gopher model trained on 300 billion tokens, despite using the same compute budget. The practical implication was immediate: the community shifted toward training smaller models on much more data. LLaMA (65B parameters, 1.4T tokens), Mistral 7B (trained on an undisclosed but large token count), and subsequent models all reflected this insight. For AI engineers, the Chinchilla result provides a concrete guideline: if you are planning a training run, allocate your compute budget by scaling data and parameters together, rather than investing disproportionately in either. The core architectural pattern shared by GPT-1, GPT-2, and GPT-3 is: The key lever that changed across generations was scale---more parameters, more data, and more compute. The architecture itself changed remarkably little. As we will explore in Chapter 22, the scaling laws that govern this relationship are among the most important empirical findings in modern AI. Let us now walk through each component of a decoder-only Transformer in precise detail, preparing for our PyTorch implementation. The input to a decoder-only model is a sequence of token indices $(x_1, x_2, \ldots, x_T)$. Two embedding matrices are used: The input representation is the sum: $$h_0^{(t)} = W_e[x_t] + W_p[t]$$ Note that GPT models use learned positional embeddings, not the sinusoidal embeddings from the original Transformer paper. Learned embeddings give the model more flexibility to discover positional patterns from data. Each attention head computes queries, keys, and values: $$Q = h W^Q, \quad K = h W^K, \quad V = h W^V$$ where $W^Q, W^K, W^V \in \mathbb{R}^{d \times d_k}$ and $d_k = d / n_{\text{heads}}$. The causal attention scores are: $$A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)$$ where $M$ is the causal mask. The output is $AV$, and the outputs from all heads are concatenated and projected: $$\text{MultiHead}(h) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H) W^O$$ The feed-forward network applies two linear transformations with a non-linearity: $$\text{FFN}(x) = \text{GELU}(x W_1 + b_1) W_2 + b_2$$ where $W_1 \in \mathbb{R}^{d \times d_{ff}}$ and $W_2 \in \mathbb{R}^{d_{ff} \times d}$. The inner dimension $d_{ff}$ is typically $4d$. GPT models use the GELU (Gaussian Error Linear Unit) activation rather than ReLU: $$\text{GELU}(x) = x \cdot \Phi(x)$$ where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution. Intuitively, GELU acts like a "smooth gate": for large positive inputs it behaves like the identity function, for large negative inputs it outputs approximately zero, and for inputs near zero it provides a smooth, stochastic-like transition. Compared to ReLU, GELU avoids the "dead neuron" problem (where neurons with negative pre-activation never recover) and provides non-zero gradients for slightly negative inputs. The choice of GELU in GPT models has become standard across most modern Transformer architectures. Modern variants like LLaMA use the SwiGLU activation, which replaces the single linear-then-activation pattern with a gated linear unit: $$\text{SwiGLU}(x) = \text{SiLU}(xW_1) \otimes xV$$ where $\text{SiLU}(x) = x \cdot \sigma(x)$ is the Sigmoid Linear Unit and $\otimes$ denotes element-wise multiplication. SwiGLU provides an additional learnable gating mechanism that empirically improves language modeling performance, though it requires a third weight matrix $V$, slightly increasing parameters per layer. GPT-1 used post-norm (layer norm after the residual connection), following the original Transformer: $$h' = \text{LayerNorm}(h + \text{Attention}(h))$$ GPT-2 and later models switched to pre-norm (layer norm before the sub-layer): $$h' = h + \text{Attention}(\text{LayerNorm}(h))$$ Pre-norm is more stable during training because the residual stream is not normalized, allowing gradients to flow more freely through the residual connections. After the final Transformer block, the hidden states are projected to logits over the vocabulary: $$\text{logits} = h_L W_e^\top$$ Many GPT models tie the output projection weights with the input embedding matrix $W_e$, a technique called weight tying. This reduces the number of parameters and provides a regularization effect, since the model must use the same representation space for both input and output. Now let us build a complete decoder-only language model from scratch. Our mini-GPT will be small enough to train on a single GPU but will contain all the essential components of the full GPT architecture. Let us verify the parameter count. For our mini-GPT with $d = 384$, $L = 6$, $H = 6$, $V = 50257$: Total: approximately 30 million parameters---a manageable size for experimentation on a single GPU. Once a language model is trained, we need a procedure for generating text. Given a prompt $(x_1, \ldots, x_k)$, we compute the distribution $P(x_{k+1} \mid x_1, \ldots, x_k)$ and select the next token. We then extend the prompt by one token and repeat. The question is: how should we select each token from the predicted distribution? The simplest strategy is to always pick the most likely token: $$x_{t+1} = \arg\max_{v} P(v \mid x_1, \ldots, x_t)$$ Advantages: Deterministic, fast, often produces grammatically correct text. Disadvantages: Tends to produce repetitive, boring text. It always follows the single highest-probability path, which often leads to degenerate loops like "I think that I think that I think that...". Before sampling, we can divide the logits by a temperature parameter $\tau > 0$: $$P_\tau(v \mid \mathbf{x}_{ where $z_v$ are the raw logits. The temperature controls the "sharpness" of the distribution: Temperature does not change the ranking of tokens---the most likely token remains the most likely---but it changes the relative probabilities. Top-k sampling (Fan et al., 2018) restricts sampling to the $k$ most likely tokens: This prevents the model from sampling very unlikely tokens that could derail generation, while still maintaining diversity among the most likely candidates. Trade-off: Choosing $k$ is a fixed threshold that does not adapt to the distribution. When the model is confident (distribution is peaked), even $k = 10$ may include very unlikely tokens. When the model is uncertain (distribution is flat), $k = 10$ may exclude plausible continuations. Top-p sampling, also called nucleus sampling (Holtzman et al., 2020), addresses the fixed-$k$ problem by dynamically choosing the set of tokens to sample from. Instead of a fixed number of tokens, it selects the smallest set of tokens whose cumulative probability exceeds a threshold $p$: For example, with $p = 0.9$, if the top 3 tokens have probabilities $(0.5, 0.3, 0.15, 0.03, 0.02)$, the nucleus is $\{1, 2, 3\}$ (cumulative: $0.5, 0.8, 0.95 > 0.9$). Advantages: Adapts to the shape of the distribution. When the model is confident, the nucleus is small. When the model is uncertain, the nucleus is large. To build intuition, consider a model that outputs the following probability distribution for the next token after the prompt "The cat sat on the": This example illustrates why practitioners typically combine temperature with nucleus sampling: temperature controls the overall randomness, while nucleus sampling provides a safety net against extremely unlikely tokens. In practice, these strategies are often combined: During autoregressive generation, each new token requires a full forward pass through the model. At step $t$, the model processes the sequence $(x_1, \ldots, x_t)$ to predict $x_{t+1}$. At step $t+1$, it processes $(x_1, \ldots, x_t, x_{t+1})$ to predict $x_{t+2}$. Notice that the keys and values for positions $1$ through $t$ are recomputed at every step, even though they never change (because causal masking ensures that earlier positions do not depend on later ones). KV caching eliminates this redundancy by storing the key and value tensors from previous steps. At each generation step, only the new token's key and value are computed and appended to the cache. The attention computation then uses the full cached keys and values, but only computes a new query for the latest position. Without KV cache (generating $T$ tokens from a prompt of length $P$): $$\text{Total attention FLOPs} \propto \sum_{t=P}^{P+T} t \cdot d = O(T \cdot (P + T) \cdot d)$$ With KV cache: $$\text{Total attention FLOPs} \propto \sum_{t=P}^{P+T} d = O(T \cdot d)$$ For long sequences, KV caching provides a dramatic speedup---quadratic becomes linear in the generation length. KV caching trades memory for speed. For a model with $L$ layers, $H$ heads, head dimension $d_k$, and a cache of length $S$: $$\text{KV cache memory} = 2 \times L \times S \times H \times d_k \times \text{sizeof(dtype)}$$ For GPT-2 Large ($L=36$, $H=20$, $d_k=64$, $S=1024$, float16): $$2 \times 36 \times 1024 \times 20 \times 64 \times 2 \text{ bytes} \approx 189 \text{ MB per sequence}$$ For larger models with longer context windows, the KV cache can consume gigabytes of memory, which is one of the bottlenecks in serving LLMs in production. Several techniques have been developed to address this bottleneck: Let us now see how to use a pre-trained GPT-2 model from the HuggingFace Perplexity is the standard evaluation metric for language models: $$\text{PPL} = \exp\left(-\frac{1}{N}\sum_{t=1}^{N} \log P(x_t \mid \mathbf{x}_{ A lower perplexity indicates that the model assigns higher probability to the text---in other words, the text is less "surprising" to the model. Let us put our mini-GPT implementation to work by training it on a real text dataset. Autoregressive models have a well-known tendency to produce repetitive text, especially with greedy or low-temperature decoding. This happens because once a phrase appears in the context, the model assigns higher probability to it appearing again (since repeated phrases are common in training data). The repetition penalty (Keskar et al., 2019) discourages the model from producing tokens that have already appeared in the generated text: $$\tilde{z}_v = \begin{cases} z_v / \alpha & \text{if } z_v > 0 \text{ and } v \in \text{generated tokens} \\ z_v \times \alpha & \text{if } z_v < 0 \text{ and } v \in \text{generated tokens} \\ z_v & \text{otherwise} \end{cases}$$ where $\alpha > 1$ is the penalty factor. This reduces the probability of tokens that have already been generated, without completely prohibiting them. OpenAI's API popularized two related penalties: While less common for open-ended generation, beam search is still used for tasks like machine translation. It maintains $B$ candidate sequences (beams) and at each step expands each beam with the top-$k$ tokens, keeping only the best $B$ overall: Research has shown that different attention heads in GPT-style models learn to specialize in different roles: The triangular structure of the attention matrix is clearly visible: each row (query) only has non-zero attention weights for columns (keys) up to and including its own position. The original GPT models use learned position embeddings: a separate embedding vector for each absolute position. This is simple and effective but limits the model to the maximum sequence length seen during training. Modern decoder-only models (e.g., Llama, Mistral) often use Rotary Positional Embeddings (Su et al., 2021). RoPE encodes position information by rotating the query and key vectors: $$\text{RoPE}(x_m, m) = R_m x_m$$ where $R_m$ is a rotation matrix that depends on position $m$. The key property is that the dot product $\langle R_m q, R_n k \rangle$ depends only on the relative position $m - n$, giving the model a natural sense of distance that generalizes to longer sequences. ALiBi (Press et al., 2022) takes a different approach: instead of adding positional information to the embeddings, it adds a linear bias to the attention scores that penalizes distant positions: $$\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} - m \cdot |i - j|\right)$$ where $m$ is a head-specific slope. This approach allows models to extrapolate to longer sequences than they were trained on. Meta's Llama models (Touvron et al., 2023) are decoder-only Transformers that incorporate several modern improvements: Mistral AI's models (Jiang et al., 2023) introduced several architectural innovations aimed at improving efficiency without sacrificing quality: Mistral 7B outperformed the much larger Llama 2 13B on most benchmarks, demonstrating that architectural innovation and data quality can compensate for raw parameter count. The Technology Innovation Institute's Falcon models (Penedo et al., 2023) demonstrated the importance of data curation. Falcon was trained on the RefinedWeb dataset, a carefully filtered and deduplicated subset of Common Crawl. The key insight was that web data, when properly curated, can match or exceed the quality of manually assembled datasets. Falcon 40B achieved state-of-the-art results among open models at the time of release, with its success attributed primarily to data quality rather than architectural novelty. Microsoft's Phi series (Gunasekar et al., 2023) challenged the assumption that scale is necessary for strong performance. Phi-1 (1.3B parameters) achieved remarkable coding ability by training on carefully selected "textbook-quality" data. The subsequent Phi-2 (2.7B) and Phi-3 (3.8B) models extended this philosophy to general language understanding. The Phi results suggest that data quality can substitute for model size to a surprising degree, an insight that is especially important for deployment scenarios where inference cost must be minimized. The open-source ecosystem has evolved rapidly, and the choice of base model depends on the specific use case: Phi for latency-sensitive deployment, Llama 3 for general-purpose applications, Mixtral for throughput-optimized serving, and specialized models for domain-specific tasks. We will explore how to fine-tune these models in Chapter 24 and align them in Chapter 25. Autoregressive generation is inherently sequential: each token must be generated before the next can begin. For large models, each forward pass is expensive, and the total generation time scales linearly with the number of output tokens. This makes generation the primary bottleneck in production LLM systems, especially for long outputs. Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) accelerates generation by using a smaller, faster draft model to propose multiple tokens at once, then verifying those tokens in parallel with the larger target model. The key insight is that verification is much cheaper than generation: the target model can check $K$ proposed tokens in a single forward pass (the same cost as generating one token), whereas generating those tokens one at a time would require $K$ forward passes. The algorithm works as follows: For each draft token $\tilde{y}_t$, the acceptance probability is: $$\alpha_t = \min\left(1, \frac{p(\tilde{y}_t | y_{ If $\tilde{y}_t$ is rejected (with probability $1 - \alpha_t$), a corrected token is sampled from: $$y_t \sim \text{norm}\left(\max\left(0, p(\cdot | y_{ This rejection sampling scheme guarantees that the output distribution is identical to sampling from the target model alone---speculative decoding provides a speedup with no quality degradation. The speedup depends on how well the draft model approximates the target model. If the average acceptance rate per token is $\alpha$, the expected number of tokens accepted per verification step is: $$\mathbb{E}[\text{accepted tokens}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}$$ In practice, with a well-chosen draft model, speculative decoding achieves 2-3x speedup. The draft model is typically a smaller model from the same family (e.g., Llama-7B drafting for Llama-70B) or a purpose-trained lightweight model. Loss spikes: Sudden increases in loss during training. Common causes include learning rate too high, insufficient warmup, or numerical overflow in attention. Solutions: gradient clipping, learning rate warmup, mixed-precision training with loss scaling. Degenerate repetition during generation: If the model generates the same token or phrase repeatedly, try:
- Increasing temperature.
- Using top-p sampling instead of greedy decoding.
- Adding a repetition penalty. Context length mismatch: If you train with a block size of 256 but try to generate with a context longer than 256 tokens, the model will either crash or produce garbage. Always respect the training context length. Forgetting the causal mask: Without the mask, the model can "see the future" during training and learns to cheat. The training loss will drop rapidly but the model will not generate coherent text. Off-by-one errors in targets: The target sequence should be the input sequence shifted by one position. Make sure Not applying the mask correctly during cached generation: When using KV cache, the mask should only apply to the new tokens, not the cached keys. Flash Attention: Use Mixed precision: Use Gradient accumulation: If your GPU cannot fit large batch sizes, accumulate gradients over multiple mini-batches. Beyond the basic generation examples shown in Section 21.9, production systems often require more nuanced control over the generation process. Here we present several common patterns. For interactive applications, streaming delivers tokens to the user as they are generated, reducing perceived latency: When processing many prompts, batching amortizes the cost of model loading and memory transfers: Production systems often need generation to stop at specific tokens or patterns: These patterns form the building blocks of production LLM applications. As we will explore in Chapter 23, the prompts fed to these generation pipelines are themselves an engineering discipline. The autoregressive language model is the foundation, but the models people interact with (ChatGPT, Claude) undergo additional training stages: We will explore these alignment techniques in Chapter 25. The key insight is that the base autoregressive model provides the "knowledge" and "capability," while alignment training shapes the model's behavior to be helpful, harmless, and honest. In this chapter, we have explored the decoder-only Transformer architecture and the autoregressive language modeling paradigm: Autoregressive factorization decomposes a sequence probability into a product of conditional probabilities, each predicting the next token given all previous tokens. Causal masking enforces the autoregressive property during training by preventing each position from attending to future positions, while still allowing parallel computation of all predictions within a sequence. The GPT family (GPT-1, GPT-2, GPT-3) demonstrated that scaling a simple decoder-only architecture with a next-token prediction objective produces increasingly capable language models, culminating in the emergence of in-context learning. Text generation strategies including greedy decoding, temperature scaling, top-k sampling, and nucleus (top-p) sampling provide different trade-offs between quality and diversity. In practice, these are often combined. KV caching is essential for efficient autoregressive generation, eliminating redundant computation of keys and values for previously processed tokens. Modern decoder-only models (Llama, Mistral, etc.) incorporate improvements like RoPE, SwiGLU, RMSNorm, and grouped query attention, but the fundamental architecture remains a causally-masked Transformer decoder. Speculative decoding accelerates inference by using a fast draft model to propose multiple tokens, which are then verified in parallel by the target model, achieving 2-3x speedup with no quality loss. The open-source ecosystem has matured rapidly, offering a range of models from compact (Phi, 1-3B) to frontier-scale (Llama 3, 405B), each with distinct strengths for different deployment scenarios. The decoder-only autoregressive model is arguably the most important architecture in modern AI. Its simplicity---predict the next token---belies the extraordinary depth of capability that emerges at scale. In the next chapter, we will examine how these models scale, both the theoretical scaling laws that govern their behavior and the practical considerations of training models with hundreds of billions of parameters.21.2.2 Why Autoregressive?
21.2.3 The Next-Token Prediction Objective
21.3 Causal (Left-to-Right) Masking
21.3.1 The Problem with Bidirectional Attention
21.3.2 The Causal Mask
21.3.3 Visualizing the Causal Mask
21.3.4 Implementation Detail: Creating the Mask in PyTorch
torch.triu:import torch
def create_causal_mask(seq_len: int) -> torch.Tensor:
"""Create a causal (look-ahead) mask for autoregressive attention.
Args:
seq_len: Length of the input sequence.
Returns:
A boolean mask of shape (seq_len, seq_len) where True
indicates positions that should be masked (set to -inf).
"""
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
return mask
-1e9 or float('-inf')) before the softmax.21.3.5 Parallel Training Despite Sequential Generation
21.4 The GPT Architecture Family
21.4.1 GPT-1: Generative Pre-training (2018)
21.4.2 GPT-2: Language Models Are Unsupervised Multitask Learners (2019)
Feature
GPT-1
GPT-2
Parameters
117M
1.5B (largest)
Layers
12
48
Hidden dim
768
1600
Context length
512
1024
Training data
BooksCorpus
WebText (40 GB)
Vocabulary
~40K BPE
~50K BPE
21.4.3 GPT-3: Language Models Are Few-Shot Learners (2020)
21.4.4 GPT-4 and the Frontier (2023)
21.4.5 The Chinchilla Scaling Insight
21.4.6 Summary of Architectural Choices
21.5 Decoder-Only Architecture in Detail
21.5.1 Token and Positional Embeddings
21.5.2 Masked Multi-Head Self-Attention
21.5.3 Position-Wise Feed-Forward Network
21.5.4 Pre-Norm vs. Post-Norm
21.5.5 The Language Modeling Head
21.6 Implementing a Mini-GPT in PyTorch
21.6.1 Configuration
from dataclasses import dataclass
@dataclass
class GPTConfig:
"""Configuration for our mini-GPT model.
Attributes:
vocab_size: Size of the token vocabulary.
block_size: Maximum sequence length (context window).
n_layer: Number of Transformer decoder layers.
n_head: Number of attention heads.
n_embd: Embedding dimension.
dropout: Dropout probability.
"""
vocab_size: int = 50257 # GPT-2 vocabulary size
block_size: int = 256
n_layer: int = 6
n_head: int = 6
n_embd: int = 384
dropout: float = 0.1
21.6.2 Causal Self-Attention
import torch
import torch.nn as nn
import torch.nn.functional as F
class CausalSelfAttention(nn.Module):
"""Multi-head causal self-attention for autoregressive models.
Implements masked multi-head attention where each position can
only attend to itself and previous positions.
Args:
config: GPTConfig with model hyperparameters.
"""
def __init__(self, config: GPTConfig) -> None:
super().__init__()
assert config.n_embd % config.n_head == 0
# Key, query, value projections combined
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
# Output projection
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
self.attn_dropout = nn.Dropout(config.dropout)
self.resid_dropout = nn.Dropout(config.dropout)
self.n_head = config.n_head
self.n_embd = config.n_embd
# Causal mask: register as buffer (not a parameter)
self.register_buffer(
"bias",
torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass through causal self-attention.
Args:
x: Input tensor of shape (batch, seq_len, n_embd).
Returns:
Output tensor of shape (batch, seq_len, n_embd).
"""
B, T, C = x.size()
# Compute Q, K, V for all heads in batch
qkv = self.c_attn(x)
q, k, v = qkv.split(self.n_embd, dim=2)
# Reshape for multi-head attention
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
# Causal self-attention: (B, n_head, T, head_dim) x (B, n_head, head_dim, T) -> (B, n_head, T, T)
att = (q @ k.transpose(-2, -1)) * (1.0 / (k.size(-1) ** 0.5))
att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
att = self.attn_dropout(att)
y = att @ v # (B, n_head, T, head_dim)
# Reassemble all head outputs
y = y.transpose(1, 2).contiguous().view(B, T, C)
# Output projection
y = self.resid_dropout(self.c_proj(y))
return y
21.6.3 Feed-Forward Network and Transformer Block
class FeedForward(nn.Module):
"""Position-wise feed-forward network with GELU activation.
Args:
config: GPTConfig with model hyperparameters.
"""
def __init__(self, config: GPTConfig) -> None:
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
self.dropout = nn.Dropout(config.dropout)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass through the feed-forward network.
Args:
x: Input tensor of shape (batch, seq_len, n_embd).
Returns:
Output tensor of shape (batch, seq_len, n_embd).
"""
x = self.c_fc(x)
x = F.gelu(x)
x = self.c_proj(x)
x = self.dropout(x)
return x
class TransformerBlock(nn.Module):
"""A single Transformer decoder block with pre-norm architecture.
Uses the pre-norm variant (GPT-2 style) where layer normalization
is applied before the attention and feed-forward sub-layers.
Args:
config: GPTConfig with model hyperparameters.
"""
def __init__(self, config: GPTConfig) -> None:
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = FeedForward(config)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass through the Transformer block.
Args:
x: Input tensor of shape (batch, seq_len, n_embd).
Returns:
Output tensor of shape (batch, seq_len, n_embd).
"""
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
21.6.4 The Complete Mini-GPT Model
class MiniGPT(nn.Module):
"""A minimal GPT-style decoder-only language model.
Implements the core GPT architecture with token embeddings,
learned positional embeddings, stacked Transformer decoder blocks,
and a language modeling head with weight tying.
Args:
config: GPTConfig with model hyperparameters.
"""
def __init__(self, config: GPTConfig) -> None:
super().__init__()
self.config = config
self.transformer = nn.ModuleDict(dict(
wte=nn.Embedding(config.vocab_size, config.n_embd),
wpe=nn.Embedding(config.block_size, config.n_embd),
drop=nn.Dropout(config.dropout),
h=nn.ModuleList([
TransformerBlock(config) for _ in range(config.n_layer)
]),
ln_f=nn.LayerNorm(config.n_embd),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
# Weight tying
self.transformer.wte.weight = self.lm_head.weight
# Initialize weights
self.apply(self._init_weights)
def _init_weights(self, module: nn.Module) -> None:
"""Initialize weights using normal distribution.
Args:
module: The module whose weights to initialize.
"""
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(
self,
idx: torch.Tensor,
targets: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor | None]:
"""Forward pass through the mini-GPT model.
Args:
idx: Input token indices of shape (batch, seq_len).
targets: Target token indices of shape (batch, seq_len).
If provided, the cross-entropy loss is also returned.
Returns:
A tuple of (logits, loss) where logits has shape
(batch, seq_len, vocab_size) and loss is a scalar
tensor (or None if targets is not provided).
"""
B, T = idx.size()
assert T <= self.config.block_size, (
f"Sequence length {T} exceeds block size {self.config.block_size}"
)
# Token + positional embeddings
pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
tok_emb = self.transformer.wte(idx) # (B, T, n_embd)
pos_emb = self.transformer.wpe(pos) # (T, n_embd)
x = self.transformer.drop(tok_emb + pos_emb)
# Transformer blocks
for block in self.transformer.h:
x = block(x)
# Final layer norm + language modeling head
x = self.transformer.ln_f(x)
logits = self.lm_head(x) # (B, T, vocab_size)
# Compute loss if targets provided
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1),
)
return logits, loss
def count_parameters(self) -> int:
"""Count the total number of trainable parameters.
Returns:
Total number of trainable parameters.
"""
return sum(p.numel() for p in self.parameters() if p.requires_grad)
21.6.5 Model Size Calculation
21.7 Text Generation Strategies
21.7.1 Greedy Decoding
21.7.2 Temperature Scaling
21.7.3 Top-k Sampling
21.7.4 Top-p (Nucleus) Sampling
21.7.5 A Worked Example: Comparing Sampling Strategies
Token
Probability
mat
0.35
floor
0.25
bed
0.15
roof
0.10
chair
0.05
table
0.04
ground
0.03
other
0.03
21.7.6 Combining Strategies
# Common combination: temperature + top-p
output = model.generate(
input_ids,
temperature=0.8, # Slightly sharpen the distribution
top_p=0.95, # Nucleus sampling
top_k=50, # Additional safety net
max_new_tokens=100,
)
21.7.6 Implementation of Generation Strategies
@torch.no_grad()
def generate(
model: MiniGPT,
idx: torch.Tensor,
max_new_tokens: int,
temperature: float = 1.0,
top_k: int | None = None,
top_p: float | None = None,
) -> torch.Tensor:
"""Generate text autoregressively from the model.
Args:
model: A trained MiniGPT model.
idx: Initial token indices of shape (batch, seq_len).
max_new_tokens: Number of new tokens to generate.
temperature: Temperature for scaling logits.
top_k: If set, restrict sampling to top-k tokens.
top_p: If set, use nucleus sampling with this threshold.
Returns:
Token indices of shape (batch, seq_len + max_new_tokens).
"""
model.eval()
for _ in range(max_new_tokens):
# Crop to block size if necessary
idx_cond = idx[:, -model.config.block_size:]
# Forward pass
logits, _ = model(idx_cond)
logits = logits[:, -1, :] / temperature # (B, vocab_size)
# Top-k filtering
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
# Top-p (nucleus) filtering
if top_p is not None:
sorted_logits, sorted_indices = torch.sort(
logits, descending=True
)
cumulative_probs = torch.cumsum(
F.softmax(sorted_logits, dim=-1), dim=-1
)
# Remove tokens with cumulative probability above threshold
sorted_indices_to_remove = cumulative_probs > top_p
# Shift so the first token above threshold is kept
sorted_indices_to_remove[..., 1:] = (
sorted_indices_to_remove[..., :-1].clone()
)
sorted_indices_to_remove[..., 0] = 0
indices_to_remove = sorted_indices_to_remove.scatter(
1, sorted_indices, sorted_indices_to_remove
)
logits[indices_to_remove] = float('-inf')
# Sample from the distribution
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# Append to sequence
idx = torch.cat((idx, idx_next), dim=1)
return idx
21.8 KV Caching for Efficient Generation
21.8.1 The Redundancy Problem
21.8.2 The KV Cache Solution
21.8.3 KV Cache Implementation
class CausalSelfAttentionWithCache(nn.Module):
"""Causal self-attention with KV cache support.
Args:
config: GPTConfig with model hyperparameters.
"""
def __init__(self, config: GPTConfig) -> None:
super().__init__()
assert config.n_embd % config.n_head == 0
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
self.n_head = config.n_head
self.n_embd = config.n_embd
self.head_dim = config.n_embd // config.n_head
def forward(
self,
x: torch.Tensor,
kv_cache: tuple[torch.Tensor, torch.Tensor] | None = None,
) -> tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]:
"""Forward pass with optional KV cache.
Args:
x: Input tensor of shape (batch, seq_len, n_embd).
During cached generation, seq_len is typically 1.
kv_cache: Optional tuple of (cached_keys, cached_values),
each of shape (batch, n_head, cache_len, head_dim).
Returns:
A tuple of (output, new_kv_cache) where output has shape
(batch, seq_len, n_embd) and new_kv_cache is a tuple
of updated (keys, values) tensors.
"""
B, T, C = x.size()
qkv = self.c_attn(x)
q, k, v = qkv.split(self.n_embd, dim=2)
q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
# Append to cache if it exists
if kv_cache is not None:
k_cache, v_cache = kv_cache
k = torch.cat([k_cache, k], dim=2)
v = torch.cat([v_cache, v], dim=2)
new_kv_cache = (k, v)
# Compute attention
att = (q @ k.transpose(-2, -1)) * (1.0 / (self.head_dim ** 0.5))
# Apply causal mask only for the new positions
# When using cache, we only need to mask within the new tokens
if kv_cache is None and T > 1:
mask = torch.triu(
torch.ones(T, T, device=x.device), diagonal=1
).bool()
att = att.masked_fill(
mask.unsqueeze(0).unsqueeze(0), float('-inf')
)
att = F.softmax(att, dim=-1)
y = att @ v
y = y.transpose(1, 2).contiguous().view(B, T, C)
y = self.c_proj(y)
return y, new_kv_cache
21.8.4 Memory Trade-off
21.9 Using GPT-2 with HuggingFace Transformers
transformers library.21.9.1 Loading the Model and Tokenizer
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load pre-trained GPT-2 (124M parameters)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
# Tokenize a prompt
prompt = "The future of artificial intelligence"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
print(f"Prompt: {prompt}")
print(f"Token IDs: {input_ids}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(input_ids[0])}")
21.9.2 Generation with Different Strategies
import torch
torch.manual_seed(42)
# Greedy decoding
greedy_output = model.generate(
input_ids,
max_new_tokens=50,
do_sample=False,
)
print("Greedy:", tokenizer.decode(greedy_output[0], skip_special_tokens=True))
# Temperature sampling
temp_output = model.generate(
input_ids,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
)
print("Temperature 0.7:", tokenizer.decode(temp_output[0], skip_special_tokens=True))
# Top-k sampling
topk_output = model.generate(
input_ids,
max_new_tokens=50,
do_sample=True,
top_k=50,
)
print("Top-k (k=50):", tokenizer.decode(topk_output[0], skip_special_tokens=True))
# Nucleus sampling
nucleus_output = model.generate(
input_ids,
max_new_tokens=50,
do_sample=True,
top_p=0.92,
)
print("Nucleus (p=0.92):", tokenizer.decode(nucleus_output[0], skip_special_tokens=True))
21.9.3 Analyzing Token Probabilities
import torch
import torch.nn.functional as F
torch.manual_seed(42)
with torch.no_grad():
outputs = model(input_ids)
logits = outputs.logits # (1, seq_len, vocab_size)
# Get probabilities for the next token after the prompt
next_token_logits = logits[0, -1, :]
probs = F.softmax(next_token_logits, dim=-1)
# Top 10 most likely next tokens
top_probs, top_indices = torch.topk(probs, 10)
print("Top 10 next-token predictions:")
for i in range(10):
token = tokenizer.decode(top_indices[i])
print(f" {token!r}: {top_probs[i]:.4f}")
21.9.4 Computing Perplexity
def compute_perplexity(
model: GPT2LMHeadModel,
tokenizer: GPT2Tokenizer,
text: str,
) -> float:
"""Compute the perplexity of a text under the model.
Args:
model: A GPT-2 language model.
tokenizer: The corresponding tokenizer.
text: The text to evaluate.
Returns:
The perplexity of the text.
"""
input_ids = tokenizer.encode(text, return_tensors="pt")
with torch.no_grad():
outputs = model(input_ids, labels=input_ids)
loss = outputs.loss # Cross-entropy loss
return torch.exp(loss).item()
text = "The quick brown fox jumps over the lazy dog."
ppl = compute_perplexity(model, tokenizer, text)
print(f"Perplexity: {ppl:.2f}")
21.10 Training a Mini-GPT on Text Data
21.10.1 Data Preparation
import torch
from torch.utils.data import Dataset, DataLoader
class TextDataset(Dataset):
"""A simple character-level or token-level text dataset.
Args:
text: The raw text data.
tokenizer: A tokenizer (or None for character-level).
block_size: The context window size.
"""
def __init__(
self,
text: str,
tokenizer: GPT2Tokenizer | None,
block_size: int,
) -> None:
if tokenizer is not None:
self.data = torch.tensor(
tokenizer.encode(text), dtype=torch.long
)
else:
# Character-level fallback
chars = sorted(list(set(text)))
self.stoi = {ch: i for i, ch in enumerate(chars)}
self.itos = {i: ch for i, ch in enumerate(chars)}
self.data = torch.tensor(
[self.stoi[c] for c in text], dtype=torch.long
)
self.block_size = block_size
def __len__(self) -> int:
return len(self.data) - self.block_size
def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
x = self.data[idx: idx + self.block_size]
y = self.data[idx + 1: idx + self.block_size + 1]
return x, y
21.10.2 Training Loop
def train_mini_gpt(
model: MiniGPT,
dataset: TextDataset,
epochs: int = 5,
batch_size: int = 32,
learning_rate: float = 3e-4,
device: str = "cuda" if torch.cuda.is_available() else "cpu",
) -> list[float]:
"""Train the mini-GPT model on a text dataset.
Args:
model: The MiniGPT model to train.
dataset: The training dataset.
epochs: Number of training epochs.
batch_size: Batch size.
learning_rate: Learning rate for the optimizer.
device: Device to train on.
Returns:
A list of average losses per epoch.
"""
torch.manual_seed(42)
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
dataloader = DataLoader(
dataset, batch_size=batch_size, shuffle=True, drop_last=True
)
epoch_losses = []
for epoch in range(epochs):
model.train()
total_loss = 0.0
num_batches = 0
for x, y in dataloader:
x, y = x.to(device), y.to(device)
logits, loss = model(x, targets=y)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
num_batches += 1
avg_loss = total_loss / num_batches
epoch_losses.append(avg_loss)
print(f"Epoch {epoch + 1}/{epochs}, Loss: {avg_loss:.4f}")
return epoch_losses
21.11 Repetition Penalties and Advanced Decoding
21.11.1 The Repetition Problem
21.11.2 Repetition Penalty
21.11.3 Frequency and Presence Penalties
21.11.4 Beam Search
def beam_search(
model: MiniGPT,
idx: torch.Tensor,
max_new_tokens: int,
beam_width: int = 5,
) -> torch.Tensor:
"""Generate text using beam search.
Args:
model: A trained MiniGPT model.
idx: Initial token indices of shape (1, seq_len).
max_new_tokens: Number of new tokens to generate.
beam_width: Number of beams to maintain.
Returns:
The best sequence of token indices.
"""
model.eval()
device = idx.device
# Initialize beams: (sequence, cumulative_log_prob)
beams = [(idx, 0.0)]
for _ in range(max_new_tokens):
all_candidates = []
for seq, score in beams:
seq_cond = seq[:, -model.config.block_size:]
logits, _ = model(seq_cond)
log_probs = F.log_softmax(logits[:, -1, :], dim=-1)
# Get top-k continuations for this beam
top_log_probs, top_indices = torch.topk(
log_probs, beam_width, dim=-1
)
for i in range(beam_width):
new_seq = torch.cat(
[seq, top_indices[:, i:i + 1]], dim=1
)
new_score = score + top_log_probs[0, i].item()
all_candidates.append((new_seq, new_score))
# Keep the top beam_width candidates
beams = sorted(
all_candidates, key=lambda x: x[1], reverse=True
)[:beam_width]
# Return the best beam
return beams[0][0]
21.12 Understanding Attention Patterns in Decoder Models
21.12.1 Attention Head Specialization
21.12.2 Visualizing Attention
def visualize_attention(
model: GPT2LMHeadModel,
tokenizer: GPT2Tokenizer,
text: str,
layer: int = 0,
head: int = 0,
) -> None:
"""Extract and display attention weights from a GPT-2 model.
Args:
model: A GPT-2 model with output_attentions enabled.
tokenizer: The GPT-2 tokenizer.
text: Input text to analyze.
layer: Which layer's attention to visualize.
head: Which attention head to visualize.
"""
import matplotlib.pyplot as plt
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_attentions=True)
# Extract attention weights: (n_layers, batch, n_heads, seq_len, seq_len)
attention = outputs.attentions[layer][0, head].numpy()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(attention, cmap="viridis")
ax.set_xticks(range(len(tokens)))
ax.set_yticks(range(len(tokens)))
ax.set_xticklabels(tokens, rotation=90)
ax.set_yticklabels(tokens)
ax.set_xlabel("Key")
ax.set_ylabel("Query")
ax.set_title(f"Layer {layer}, Head {head}")
plt.colorbar(im)
plt.tight_layout()
plt.savefig("attention_pattern.png", dpi=150)
plt.show()
21.13 Positional Encoding Variants
21.13.1 Learned Absolute Positions (GPT-1/2/3)
21.13.2 Rotary Positional Embeddings (RoPE)
21.13.3 ALiBi (Attention with Linear Biases)
21.14 Decoder-Only Models Beyond GPT
21.14.1 Llama and the Open-Source Revolution
21.14.2 Mistral and the Efficient Frontier
21.14.3 Falcon and Data-Centric Scaling
21.14.4 Phi: The Small Model Revolution
21.14.5 The Open-Source Landscape at a Glance
Model
Organization
Parameters
Key Innovation
Chinchilla
DeepMind
70B
Compute-optimal training (as we saw in Section 21.4.5)
PaLM / PaLM 2
Google
540B / undisclosed
Pathway system, multilingual strength
Falcon
TII
7B / 40B / 180B
Curated web data (RefinedWeb)
Mistral / Mixtral
Mistral AI
7B / 8x7B
Sliding window attention, MoE
Phi-1/2/3
Microsoft
1.3B--14B
High-quality small-data training
Llama 1/2/3
Meta
7B--405B
RoPE, GQA, SwiGLU, open weights
Gemma
Google DeepMind
2B / 7B / 27B
Efficient training, multi-query attention
Qwen
Alibaba
0.5B--72B
Strong multilingual, coding ability
21.15 Speculative Decoding
21.15.1 The Inference Bottleneck
21.15.2 The Speculative Decoding Idea
21.15.3 Acceptance Criterion
21.15.4 Expected Speedup
21.16 Common Pitfalls and Debugging Tips
21.16.1 Training Instabilities
21.16.2 Common Implementation Bugs
targets[t] = input[t + 1].21.16.3 Performance Optimization
torch.nn.functional.scaled_dot_product_attention with is_causal=True for optimized attention that fuses the mask, softmax, and matrix multiplication into a single kernel.torch.cuda.amp for automatic mixed-precision training, which can provide a 2-3x speedup.
21.17 Practical Generation with HuggingFace: Extended Examples
21.17.1 Streaming Generation
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
streamer = TextStreamer(tokenizer, skip_prompt=True)
prompt = "Explain the concept of attention in neural networks:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
top_p=0.9,
do_sample=True,
streamer=streamer,
)
21.17.2 Batch Generation for Throughput
prompts = [
"Summarize the theory of relativity:",
"Explain quantum entanglement:",
"Describe natural selection:",
]
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.8,
top_p=0.95,
pad_token_id=tokenizer.eos_token_id,
)
for i, output in enumerate(outputs):
text = tokenizer.decode(output, skip_special_tokens=True)
print(f"Prompt {i+1}: {text}\n")
21.17.3 Constrained Generation with Stopping Criteria
from transformers import StoppingCriteria, StoppingCriteriaList
class StopOnNewline(StoppingCriteria):
"""Stop generation when a double newline is produced."""
def __init__(self, tokenizer):
self.stop_ids = tokenizer.encode("\n\n", add_special_tokens=False)
def __call__(self, input_ids, scores, **kwargs):
if len(self.stop_ids) > 0:
for seq in input_ids:
if seq[-len(self.stop_ids):].tolist() == self.stop_ids:
return True
return False
stopping_criteria = StoppingCriteriaList([StopOnNewline(tokenizer)])
output = model.generate(
**inputs,
max_new_tokens=500,
stopping_criteria=stopping_criteria,
)
21.18 From Language Modeling to Instruction Following
21.19 Chapter Summary