33 min read

In 2018, a relatively simple idea changed the trajectory of artificial intelligence: take the decoder half of the Transformer architecture, train it to predict the next token on a massive corpus of text, and scale it up. That idea, crystallized in...

Chapter 21: Decoder-Only Models and Autoregressive Language Models

21.1 Introduction: The Rise of Autoregressive Language Models

In 2018, a relatively simple idea changed the trajectory of artificial intelligence: take the decoder half of the Transformer architecture, train it to predict the next token on a massive corpus of text, and scale it up. That idea, crystallized in OpenAI's GPT (Generative Pre-trained Transformer), launched a revolution that would eventually produce ChatGPT, Claude, and the broader large language model (LLM) ecosystem we know today.

This chapter is dedicated to understanding decoder-only Transformer models and the autoregressive generation paradigm that powers them. While Chapter 19 introduced the full encoder-decoder Transformer and Chapter 20 explored encoder-only models like BERT, here we focus exclusively on the left-to-right, next-token-prediction framework. We will build intuition for why this seemingly simple objective produces such capable models, walk through the architectural details that distinguish decoder-only models from their encoder counterparts, and implement a working mini-GPT from scratch in PyTorch.

By the end of this chapter, you will be able to:

  • Explain the autoregressive factorization of language and why it is a natural fit for text generation.
  • Describe the causal (left-to-right) masking mechanism and how it differs from bidirectional attention.
  • Trace the architectural evolution from GPT-1 through GPT-3 and understand the key design decisions at each stage.
  • Implement a complete mini-GPT model in PyTorch, including causal self-attention, positional embeddings, and the language modeling head.
  • Apply multiple text generation strategies---greedy decoding, top-k sampling, top-p (nucleus) sampling, and temperature scaling---and understand their trade-offs.
  • Explain KV caching and why it is essential for efficient autoregressive generation.
  • Use HuggingFace's transformers library to load GPT-2 and generate text with fine-grained control over decoding parameters.

Let us begin with the mathematical foundation that underpins every autoregressive language model.


21.2 The Autoregressive Factorization of Language

21.2.1 From Joint to Conditional Probabilities

A language model assigns a probability to a sequence of tokens $\mathbf{x} = (x_1, x_2, \ldots, x_T)$. The most general way to write this is as a joint probability:

$$P(\mathbf{x}) = P(x_1, x_2, \ldots, x_T)$$

By the chain rule of probability, we can decompose this joint distribution into a product of conditional distributions:

$$P(\mathbf{x}) = \prod_{t=1}^{T} P(x_t \mid x_1, x_2, \ldots, x_{t-1}) = \prod_{t=1}^{T} P(x_t \mid \mathbf{x}_{

This decomposition is exact---no approximations are involved. It simply states that the probability of a sequence equals the probability of the first token, times the probability of the second token given the first, times the probability of the third token given the first two, and so on. This left-to-right factorization is called the autoregressive factorization, and a model that parameterizes each conditional $P(x_t \mid \mathbf{x}_{autoregressive language model.

21.2.2 Why Autoregressive?

The term "autoregressive" comes from time series analysis, where it refers to a model that predicts the current value from previous values. In the language modeling context, autoregressive means that each token is predicted based only on the tokens that came before it---never on future tokens.

This left-to-right ordering has several appealing properties:

  1. Natural generation: To generate text, we simply sample one token at a time from left to right, conditioning on all previously generated tokens. There is no need for iterative refinement or complex decoding schemes.

  2. Exact likelihood computation: Given a sequence, we can compute its exact log-likelihood as the sum of log-probabilities of each token given its prefix.

  3. No independence assumptions: Unlike bag-of-words models or n-gram models with fixed context windows, an autoregressive Transformer can (in principle) condition on the entire preceding context.

  4. Universal density estimation: The chain rule decomposition is exact, so an autoregressive model with sufficient capacity can represent any distribution over sequences.

21.2.3 The Next-Token Prediction Objective

Training an autoregressive language model reduces to a supervised learning problem. Given a training corpus tokenized into a sequence $(x_1, x_2, \ldots, x_N)$, the training objective is to maximize the log-likelihood:

$$\mathcal{L}(\theta) = \sum_{t=1}^{N} \log P_\theta(x_t \mid x_1, \ldots, x_{t-1})$$

Equivalently, we minimize the cross-entropy loss between the model's predicted distribution and the one-hot target distribution at each position:

$$\mathcal{L}_{\text{CE}}(\theta) = -\frac{1}{N} \sum_{t=1}^{N} \log P_\theta(x_t \mid \mathbf{x}_{

In practice, the training corpus is divided into fixed-length chunks (the context window or block size), and the model processes each chunk in parallel. Within each chunk of length $T$, the model predicts $T$ tokens simultaneously---position 1 predicts token 2, position 2 predicts token 3, and so on. This is made possible by causal masking, which we discuss next.


21.3 Causal (Left-to-Right) Masking

21.3.1 The Problem with Bidirectional Attention

Recall from Chapter 19 that the standard self-attention mechanism allows each position to attend to all other positions in the sequence. For an encoder model like BERT (Chapter 20), this is desirable---the representation of each token can incorporate information from both past and future context. But for an autoregressive model, bidirectional attention would be cheating: if position $t$ can attend to position $t+1$, the model could simply copy the answer rather than learn to predict it.

21.3.2 The Causal Mask

To enforce the autoregressive property, we apply a causal mask (also called a look-ahead mask) to the attention scores. Before the softmax operation, we set all attention scores $e_{ij}$ where $j > i$ to $-\infty$:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right) V$$

where the mask $M$ is an upper-triangular matrix:

$$M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}$$

After the softmax, the $-\infty$ entries become zero, so each position $i$ attends only to positions $j \leq i$ (itself and all preceding positions). This ensures that the representation at position $t$ depends only on tokens $x_1, \ldots, x_t$, preserving the autoregressive property.

21.3.3 Visualizing the Causal Mask

For a sequence of length 5, the causal attention mask looks like:

$$M = \begin{pmatrix} 0 & -\infty & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty & -\infty \\ 0 & 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 & 0 \end{pmatrix}$$

Token 1 can attend only to itself. Token 2 can attend to tokens 1 and 2. Token 3 can attend to tokens 1, 2, and 3. And so on. This creates a triangular attention pattern that is the hallmark of decoder-only models.

21.3.4 Implementation Detail: Creating the Mask in PyTorch

In PyTorch, we typically create the causal mask using torch.triu:

import torch

def create_causal_mask(seq_len: int) -> torch.Tensor:
    """Create a causal (look-ahead) mask for autoregressive attention.

    Args:
        seq_len: Length of the input sequence.

    Returns:
        A boolean mask of shape (seq_len, seq_len) where True
        indicates positions that should be masked (set to -inf).
    """
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
    return mask

This mask is then applied inside the attention computation, typically by setting masked positions to a large negative value (e.g., -1e9 or float('-inf')) before the softmax.

21.3.5 Parallel Training Despite Sequential Generation

A key insight is that during training, causal masking allows the model to compute predictions for all positions in parallel. Even though generation is sequential, training is not. The causal mask ensures that position $t$ cannot "see" tokens after position $t$, so all $T$ predictions within a chunk can be computed in a single forward pass. This is what makes Transformer-based language models so much more efficient to train than RNN-based ones, which must process tokens sequentially.


21.4 The GPT Architecture Family

21.4.1 GPT-1: Generative Pre-training (2018)

OpenAI's GPT-1 (Radford et al., 2018) was the first major demonstration that a Transformer decoder trained with a language modeling objective could produce powerful general-purpose representations. The key ideas:

  • Architecture: A 12-layer Transformer decoder with 768-dimensional hidden states, 12 attention heads, and a context length of 512 tokens. Total parameter count: 117 million.
  • Pre-training: Trained on the BooksCorpus dataset (~7,000 unpublished books, ~800 million words) using the standard next-token prediction objective.
  • Fine-tuning: After pre-training, the model was fine-tuned on downstream tasks (classification, entailment, similarity, question answering) by adding task-specific heads.
  • Tokenization: Used byte-pair encoding (BPE) with a vocabulary of ~40,000 tokens.
  • Innovation: The main contribution was showing that unsupervised pre-training on a large text corpus, followed by supervised fine-tuning, outperformed task-specific architectures on a range of benchmarks.

The GPT-1 architecture is essentially the Transformer decoder from Vaswani et al. (2017), but without cross-attention layers (since there is no encoder to attend to). Each layer consists of:

  1. Masked multi-head self-attention
  2. Layer normalization
  3. Position-wise feed-forward network
  4. Layer normalization
  5. Residual connections around both sub-layers

21.4.2 GPT-2: Language Models Are Unsupervised Multitask Learners (2019)

GPT-2 (Radford et al., 2019) scaled up GPT-1 and introduced the idea that a sufficiently large language model could perform multiple tasks without any fine-tuning---a concept they called zero-shot learning.

Key differences from GPT-1:

Feature GPT-1 GPT-2
Parameters 117M 1.5B (largest)
Layers 12 48
Hidden dim 768 1600
Context length 512 1024
Training data BooksCorpus WebText (40 GB)
Vocabulary ~40K BPE ~50K BPE

Architectural refinements in GPT-2:

  • Pre-norm architecture: Layer normalization was moved to the input of each sub-layer (before the attention or FFN), rather than after. This improves training stability, especially at larger scales.
  • Additional layer norm: A final layer normalization was added after the last Transformer block.
  • Weight initialization: Residual layers were scaled by $1/\sqrt{N}$ where $N$ is the number of residual layers, to prevent the residual stream from growing too large at initialization.

GPT-2 demonstrated remarkable zero-shot capabilities on tasks like reading comprehension, summarization, and translation, despite never being explicitly trained on these tasks. This was evidence that the next-token prediction objective, at sufficient scale, implicitly learns a wide range of language understanding capabilities.

21.4.3 GPT-3: Language Models Are Few-Shot Learners (2020)

GPT-3 (Brown et al., 2020) was a landmark in scale, with 175 billion parameters trained on a diverse corpus of ~570 GB of filtered text. Its primary contribution was demonstrating in-context learning: the ability to perform new tasks simply by providing a few examples in the prompt, without any gradient updates.

Key features of GPT-3:

  • Model sizes: The paper studied eight model sizes, from 125 million to 175 billion parameters, enabling systematic analysis of scaling behavior.
  • Alternating dense and sparse attention: Some layers used locally banded sparse attention patterns to reduce computation.
  • In-context learning paradigms:
  • Zero-shot: Task described in natural language, no examples.
  • One-shot: One example provided.
  • Few-shot: A handful of examples provided.
  • Training data: A blend of Common Crawl (filtered), WebText2, Books1, Books2, and Wikipedia.

GPT-3 showed that performance on downstream tasks scaled smoothly with model size, and that few-shot performance approached or exceeded fine-tuned smaller models on many benchmarks. This established the paradigm of prompting as an alternative to fine-tuning.

21.4.4 GPT-4 and the Frontier (2023)

While OpenAI released fewer technical details about GPT-4, several key themes emerged from the technical report (OpenAI, 2023):

  • Multimodal input: GPT-4 accepts both text and image inputs, though the decoder-only autoregressive architecture remains at its core. Image inputs are processed through a vision encoder whose outputs are projected into the same embedding space as text tokens.
  • Mixture of Experts (MoE): Widely reported (though not officially confirmed) to use a mixture-of-experts architecture, where only a subset of parameters is active for any given token. This allows the total parameter count to be very large while keeping per-token computation manageable.
  • Extended context: GPT-4 supports context lengths up to 128K tokens in its extended variant, a dramatic increase from GPT-3's 2048 tokens.
  • Predictable scaling: The technical report emphasizes that final model performance was accurately predicted from smaller-scale training runs using scaling laws, enabling confident resource allocation before committing to full-scale training.

The evolution from GPT-1 to GPT-4 illustrates a remarkable pattern: the core autoregressive decoder architecture changed surprisingly little, while scale, data quality, training infrastructure, and post-training alignment techniques drove the dramatic improvements in capability.

21.4.5 The Chinchilla Scaling Insight

An important inflection point in the GPT lineage came not from OpenAI but from DeepMind. Hoffmann et al. (2022) published the Chinchilla paper, which demonstrated that most large language models were significantly undertrained relative to their size. The key finding was a scaling law relating optimal model size $N$ and training tokens $D$ for a fixed compute budget $C$:

$$N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}$$

This means that model size and training data should be scaled in roughly equal proportion. Prior to Chinchilla, the convention (following the Kaplan et al. scaling laws) favored larger models trained on relatively fewer tokens. The Chinchilla analysis showed that a 70B parameter model trained on 1.4 trillion tokens could outperform the 280B parameter Gopher model trained on 300 billion tokens, despite using the same compute budget.

The practical implication was immediate: the community shifted toward training smaller models on much more data. LLaMA (65B parameters, 1.4T tokens), Mistral 7B (trained on an undisclosed but large token count), and subsequent models all reflected this insight. For AI engineers, the Chinchilla result provides a concrete guideline: if you are planning a training run, allocate your compute budget by scaling data and parameters together, rather than investing disproportionately in either.

21.4.6 Summary of Architectural Choices

The core architectural pattern shared by GPT-1, GPT-2, and GPT-3 is:

  1. Token embedding: Map each token to a dense vector.
  2. Positional embedding: Add learned positional embeddings (not sinusoidal).
  3. Transformer decoder blocks: Stack $L$ layers of masked self-attention + FFN.
  4. Language modeling head: Project the final hidden state back to vocabulary size and apply softmax.

The key lever that changed across generations was scale---more parameters, more data, and more compute. The architecture itself changed remarkably little. As we will explore in Chapter 22, the scaling laws that govern this relationship are among the most important empirical findings in modern AI.


21.5 Decoder-Only Architecture in Detail

Let us now walk through each component of a decoder-only Transformer in precise detail, preparing for our PyTorch implementation.

21.5.1 Token and Positional Embeddings

The input to a decoder-only model is a sequence of token indices $(x_1, x_2, \ldots, x_T)$. Two embedding matrices are used:

  • Token embedding $W_e \in \mathbb{R}^{V \times d}$: Maps each token index to a $d$-dimensional vector.
  • Positional embedding $W_p \in \mathbb{R}^{T_{\max} \times d}$: Provides a unique embedding for each position up to the maximum context length $T_{\max}$.

The input representation is the sum:

$$h_0^{(t)} = W_e[x_t] + W_p[t]$$

Note that GPT models use learned positional embeddings, not the sinusoidal embeddings from the original Transformer paper. Learned embeddings give the model more flexibility to discover positional patterns from data.

21.5.2 Masked Multi-Head Self-Attention

Each attention head computes queries, keys, and values:

$$Q = h W^Q, \quad K = h W^K, \quad V = h W^V$$

where $W^Q, W^K, W^V \in \mathbb{R}^{d \times d_k}$ and $d_k = d / n_{\text{heads}}$.

The causal attention scores are:

$$A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)$$

where $M$ is the causal mask. The output is $AV$, and the outputs from all heads are concatenated and projected:

$$\text{MultiHead}(h) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H) W^O$$

21.5.3 Position-Wise Feed-Forward Network

The feed-forward network applies two linear transformations with a non-linearity:

$$\text{FFN}(x) = \text{GELU}(x W_1 + b_1) W_2 + b_2$$

where $W_1 \in \mathbb{R}^{d \times d_{ff}}$ and $W_2 \in \mathbb{R}^{d_{ff} \times d}$. The inner dimension $d_{ff}$ is typically $4d$. GPT models use the GELU (Gaussian Error Linear Unit) activation rather than ReLU:

$$\text{GELU}(x) = x \cdot \Phi(x)$$

where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution. Intuitively, GELU acts like a "smooth gate": for large positive inputs it behaves like the identity function, for large negative inputs it outputs approximately zero, and for inputs near zero it provides a smooth, stochastic-like transition. Compared to ReLU, GELU avoids the "dead neuron" problem (where neurons with negative pre-activation never recover) and provides non-zero gradients for slightly negative inputs. The choice of GELU in GPT models has become standard across most modern Transformer architectures.

Modern variants like LLaMA use the SwiGLU activation, which replaces the single linear-then-activation pattern with a gated linear unit:

$$\text{SwiGLU}(x) = \text{SiLU}(xW_1) \otimes xV$$

where $\text{SiLU}(x) = x \cdot \sigma(x)$ is the Sigmoid Linear Unit and $\otimes$ denotes element-wise multiplication. SwiGLU provides an additional learnable gating mechanism that empirically improves language modeling performance, though it requires a third weight matrix $V$, slightly increasing parameters per layer.

21.5.4 Pre-Norm vs. Post-Norm

GPT-1 used post-norm (layer norm after the residual connection), following the original Transformer:

$$h' = \text{LayerNorm}(h + \text{Attention}(h))$$

GPT-2 and later models switched to pre-norm (layer norm before the sub-layer):

$$h' = h + \text{Attention}(\text{LayerNorm}(h))$$

Pre-norm is more stable during training because the residual stream is not normalized, allowing gradients to flow more freely through the residual connections.

21.5.5 The Language Modeling Head

After the final Transformer block, the hidden states are projected to logits over the vocabulary:

$$\text{logits} = h_L W_e^\top$$

Many GPT models tie the output projection weights with the input embedding matrix $W_e$, a technique called weight tying. This reduces the number of parameters and provides a regularization effect, since the model must use the same representation space for both input and output.


21.6 Implementing a Mini-GPT in PyTorch

Now let us build a complete decoder-only language model from scratch. Our mini-GPT will be small enough to train on a single GPU but will contain all the essential components of the full GPT architecture.

21.6.1 Configuration

from dataclasses import dataclass

@dataclass
class GPTConfig:
    """Configuration for our mini-GPT model.

    Attributes:
        vocab_size: Size of the token vocabulary.
        block_size: Maximum sequence length (context window).
        n_layer: Number of Transformer decoder layers.
        n_head: Number of attention heads.
        n_embd: Embedding dimension.
        dropout: Dropout probability.
    """
    vocab_size: int = 50257  # GPT-2 vocabulary size
    block_size: int = 256
    n_layer: int = 6
    n_head: int = 6
    n_embd: int = 384
    dropout: float = 0.1

21.6.2 Causal Self-Attention

import torch
import torch.nn as nn
import torch.nn.functional as F

class CausalSelfAttention(nn.Module):
    """Multi-head causal self-attention for autoregressive models.

    Implements masked multi-head attention where each position can
    only attend to itself and previous positions.

    Args:
        config: GPTConfig with model hyperparameters.
    """

    def __init__(self, config: GPTConfig) -> None:
        super().__init__()
        assert config.n_embd % config.n_head == 0

        # Key, query, value projections combined
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # Output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)

        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd

        # Causal mask: register as buffer (not a parameter)
        self.register_buffer(
            "bias",
            torch.tril(torch.ones(config.block_size, config.block_size))
            .view(1, 1, config.block_size, config.block_size)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through causal self-attention.

        Args:
            x: Input tensor of shape (batch, seq_len, n_embd).

        Returns:
            Output tensor of shape (batch, seq_len, n_embd).
        """
        B, T, C = x.size()

        # Compute Q, K, V for all heads in batch
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)

        # Reshape for multi-head attention
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        # Causal self-attention: (B, n_head, T, head_dim) x (B, n_head, head_dim, T) -> (B, n_head, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / (k.size(-1) ** 0.5))
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v  # (B, n_head, T, head_dim)

        # Reassemble all head outputs
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        # Output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

21.6.3 Feed-Forward Network and Transformer Block

class FeedForward(nn.Module):
    """Position-wise feed-forward network with GELU activation.

    Args:
        config: GPTConfig with model hyperparameters.
    """

    def __init__(self, config: GPTConfig) -> None:
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through the feed-forward network.

        Args:
            x: Input tensor of shape (batch, seq_len, n_embd).

        Returns:
            Output tensor of shape (batch, seq_len, n_embd).
        """
        x = self.c_fc(x)
        x = F.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x


class TransformerBlock(nn.Module):
    """A single Transformer decoder block with pre-norm architecture.

    Uses the pre-norm variant (GPT-2 style) where layer normalization
    is applied before the attention and feed-forward sub-layers.

    Args:
        config: GPTConfig with model hyperparameters.
    """

    def __init__(self, config: GPTConfig) -> None:
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = FeedForward(config)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through the Transformer block.

        Args:
            x: Input tensor of shape (batch, seq_len, n_embd).

        Returns:
            Output tensor of shape (batch, seq_len, n_embd).
        """
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

21.6.4 The Complete Mini-GPT Model

class MiniGPT(nn.Module):
    """A minimal GPT-style decoder-only language model.

    Implements the core GPT architecture with token embeddings,
    learned positional embeddings, stacked Transformer decoder blocks,
    and a language modeling head with weight tying.

    Args:
        config: GPTConfig with model hyperparameters.
    """

    def __init__(self, config: GPTConfig) -> None:
        super().__init__()
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte=nn.Embedding(config.vocab_size, config.n_embd),
            wpe=nn.Embedding(config.block_size, config.n_embd),
            drop=nn.Dropout(config.dropout),
            h=nn.ModuleList([
                TransformerBlock(config) for _ in range(config.n_layer)
            ]),
            ln_f=nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # Weight tying
        self.transformer.wte.weight = self.lm_head.weight

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module: nn.Module) -> None:
        """Initialize weights using normal distribution.

        Args:
            module: The module whose weights to initialize.
        """
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(
        self,
        idx: torch.Tensor,
        targets: torch.Tensor | None = None,
    ) -> tuple[torch.Tensor, torch.Tensor | None]:
        """Forward pass through the mini-GPT model.

        Args:
            idx: Input token indices of shape (batch, seq_len).
            targets: Target token indices of shape (batch, seq_len).
                If provided, the cross-entropy loss is also returned.

        Returns:
            A tuple of (logits, loss) where logits has shape
            (batch, seq_len, vocab_size) and loss is a scalar
            tensor (or None if targets is not provided).
        """
        B, T = idx.size()
        assert T <= self.config.block_size, (
            f"Sequence length {T} exceeds block size {self.config.block_size}"
        )

        # Token + positional embeddings
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        tok_emb = self.transformer.wte(idx)       # (B, T, n_embd)
        pos_emb = self.transformer.wpe(pos)        # (T, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)

        # Transformer blocks
        for block in self.transformer.h:
            x = block(x)

        # Final layer norm + language modeling head
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)  # (B, T, vocab_size)

        # Compute loss if targets provided
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1),
            )

        return logits, loss

    def count_parameters(self) -> int:
        """Count the total number of trainable parameters.

        Returns:
            Total number of trainable parameters.
        """
        return sum(p.numel() for p in self.parameters() if p.requires_grad)

21.6.5 Model Size Calculation

Let us verify the parameter count. For our mini-GPT with $d = 384$, $L = 6$, $H = 6$, $V = 50257$:

  • Token embeddings: $V \times d = 50257 \times 384 \approx 19.3\text{M}$
  • Positional embeddings: $T_{\max} \times d = 256 \times 384 \approx 98\text{K}$
  • Per layer:
  • QKV projection: $d \times 3d = 384 \times 1152 \approx 442\text{K}$
  • Output projection: $d \times d = 384 \times 384 \approx 147\text{K}$
  • FFN up: $d \times 4d = 384 \times 1536 \approx 590\text{K}$
  • FFN down: $4d \times d = 1536 \times 384 \approx 590\text{K}$
  • Layer norms: $2 \times 2d = 1536$
  • Total per layer: $\approx 1.77\text{M}$
  • All layers: $6 \times 1.77\text{M} \approx 10.6\text{M}$
  • Final layer norm: $2 \times 384 = 768$
  • LM head: weight-tied with token embeddings, so 0 additional parameters.

Total: approximately 30 million parameters---a manageable size for experimentation on a single GPU.


21.7 Text Generation Strategies

Once a language model is trained, we need a procedure for generating text. Given a prompt $(x_1, \ldots, x_k)$, we compute the distribution $P(x_{k+1} \mid x_1, \ldots, x_k)$ and select the next token. We then extend the prompt by one token and repeat. The question is: how should we select each token from the predicted distribution?

21.7.1 Greedy Decoding

The simplest strategy is to always pick the most likely token:

$$x_{t+1} = \arg\max_{v} P(v \mid x_1, \ldots, x_t)$$

Advantages: Deterministic, fast, often produces grammatically correct text.

Disadvantages: Tends to produce repetitive, boring text. It always follows the single highest-probability path, which often leads to degenerate loops like "I think that I think that I think that...".

21.7.2 Temperature Scaling

Before sampling, we can divide the logits by a temperature parameter $\tau > 0$:

$$P_\tau(v \mid \mathbf{x}_{

where $z_v$ are the raw logits. The temperature controls the "sharpness" of the distribution:

  • $\tau \to 0$: The distribution becomes peaked (approaches greedy decoding).
  • $\tau = 1$: The original distribution is used.
  • $\tau > 1$: The distribution becomes flatter (more random/creative).

Temperature does not change the ranking of tokens---the most likely token remains the most likely---but it changes the relative probabilities.

21.7.3 Top-k Sampling

Top-k sampling (Fan et al., 2018) restricts sampling to the $k$ most likely tokens:

  1. Compute the full probability distribution.
  2. Zero out all probabilities outside the top-$k$ tokens.
  3. Renormalize the remaining probabilities.
  4. Sample from the truncated distribution.

This prevents the model from sampling very unlikely tokens that could derail generation, while still maintaining diversity among the most likely candidates.

Trade-off: Choosing $k$ is a fixed threshold that does not adapt to the distribution. When the model is confident (distribution is peaked), even $k = 10$ may include very unlikely tokens. When the model is uncertain (distribution is flat), $k = 10$ may exclude plausible continuations.

21.7.4 Top-p (Nucleus) Sampling

Top-p sampling, also called nucleus sampling (Holtzman et al., 2020), addresses the fixed-$k$ problem by dynamically choosing the set of tokens to sample from. Instead of a fixed number of tokens, it selects the smallest set of tokens whose cumulative probability exceeds a threshold $p$:

  1. Sort tokens by probability in descending order.
  2. Compute the cumulative sum of probabilities.
  3. Include all tokens up to and including the one where the cumulative sum first exceeds $p$.
  4. Renormalize and sample.

For example, with $p = 0.9$, if the top 3 tokens have probabilities $(0.5, 0.3, 0.15, 0.03, 0.02)$, the nucleus is $\{1, 2, 3\}$ (cumulative: $0.5, 0.8, 0.95 > 0.9$).

Advantages: Adapts to the shape of the distribution. When the model is confident, the nucleus is small. When the model is uncertain, the nucleus is large.

21.7.5 A Worked Example: Comparing Sampling Strategies

To build intuition, consider a model that outputs the following probability distribution for the next token after the prompt "The cat sat on the":

Token Probability
mat 0.35
floor 0.25
bed 0.15
roof 0.10
chair 0.05
table 0.04
ground 0.03
other 0.03
  • Greedy decoding always selects "mat" (probability 0.35).
  • Top-k with $k = 3$ samples from {mat, floor, bed} with renormalized probabilities {0.467, 0.333, 0.200}.
  • Top-p with $p = 0.9$ includes tokens until cumulative probability exceeds 0.9: {mat, floor, bed, roof, chair} (cumulative: 0.35, 0.60, 0.75, 0.85, 0.90). The nucleus contains 5 tokens.
  • Temperature $\tau = 0.5$ sharpens the distribution: "mat" gets probability ~0.54, "floor" ~0.27, making the top token much more dominant.
  • Temperature $\tau = 2.0$ flattens the distribution: "mat" gets probability ~0.19, and even rare tokens like "ground" get ~0.10, leading to more surprising but potentially incoherent outputs.

This example illustrates why practitioners typically combine temperature with nucleus sampling: temperature controls the overall randomness, while nucleus sampling provides a safety net against extremely unlikely tokens.

21.7.6 Combining Strategies

In practice, these strategies are often combined:

# Common combination: temperature + top-p
output = model.generate(
    input_ids,
    temperature=0.8,  # Slightly sharpen the distribution
    top_p=0.95,        # Nucleus sampling
    top_k=50,          # Additional safety net
    max_new_tokens=100,
)

21.7.6 Implementation of Generation Strategies

@torch.no_grad()
def generate(
    model: MiniGPT,
    idx: torch.Tensor,
    max_new_tokens: int,
    temperature: float = 1.0,
    top_k: int | None = None,
    top_p: float | None = None,
) -> torch.Tensor:
    """Generate text autoregressively from the model.

    Args:
        model: A trained MiniGPT model.
        idx: Initial token indices of shape (batch, seq_len).
        max_new_tokens: Number of new tokens to generate.
        temperature: Temperature for scaling logits.
        top_k: If set, restrict sampling to top-k tokens.
        top_p: If set, use nucleus sampling with this threshold.

    Returns:
        Token indices of shape (batch, seq_len + max_new_tokens).
    """
    model.eval()
    for _ in range(max_new_tokens):
        # Crop to block size if necessary
        idx_cond = idx[:, -model.config.block_size:]

        # Forward pass
        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature  # (B, vocab_size)

        # Top-k filtering
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[:, [-1]]] = float('-inf')

        # Top-p (nucleus) filtering
        if top_p is not None:
            sorted_logits, sorted_indices = torch.sort(
                logits, descending=True
            )
            cumulative_probs = torch.cumsum(
                F.softmax(sorted_logits, dim=-1), dim=-1
            )
            # Remove tokens with cumulative probability above threshold
            sorted_indices_to_remove = cumulative_probs > top_p
            # Shift so the first token above threshold is kept
            sorted_indices_to_remove[..., 1:] = (
                sorted_indices_to_remove[..., :-1].clone()
            )
            sorted_indices_to_remove[..., 0] = 0
            indices_to_remove = sorted_indices_to_remove.scatter(
                1, sorted_indices, sorted_indices_to_remove
            )
            logits[indices_to_remove] = float('-inf')

        # Sample from the distribution
        probs = F.softmax(logits, dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)

        # Append to sequence
        idx = torch.cat((idx, idx_next), dim=1)

    return idx

21.8 KV Caching for Efficient Generation

21.8.1 The Redundancy Problem

During autoregressive generation, each new token requires a full forward pass through the model. At step $t$, the model processes the sequence $(x_1, \ldots, x_t)$ to predict $x_{t+1}$. At step $t+1$, it processes $(x_1, \ldots, x_t, x_{t+1})$ to predict $x_{t+2}$. Notice that the keys and values for positions $1$ through $t$ are recomputed at every step, even though they never change (because causal masking ensures that earlier positions do not depend on later ones).

21.8.2 The KV Cache Solution

KV caching eliminates this redundancy by storing the key and value tensors from previous steps. At each generation step, only the new token's key and value are computed and appended to the cache. The attention computation then uses the full cached keys and values, but only computes a new query for the latest position.

Without KV cache (generating $T$ tokens from a prompt of length $P$):

$$\text{Total attention FLOPs} \propto \sum_{t=P}^{P+T} t \cdot d = O(T \cdot (P + T) \cdot d)$$

With KV cache:

$$\text{Total attention FLOPs} \propto \sum_{t=P}^{P+T} d = O(T \cdot d)$$

For long sequences, KV caching provides a dramatic speedup---quadratic becomes linear in the generation length.

21.8.3 KV Cache Implementation

class CausalSelfAttentionWithCache(nn.Module):
    """Causal self-attention with KV cache support.

    Args:
        config: GPTConfig with model hyperparameters.
    """

    def __init__(self, config: GPTConfig) -> None:
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_dim = config.n_embd // config.n_head

    def forward(
        self,
        x: torch.Tensor,
        kv_cache: tuple[torch.Tensor, torch.Tensor] | None = None,
    ) -> tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]:
        """Forward pass with optional KV cache.

        Args:
            x: Input tensor of shape (batch, seq_len, n_embd).
                During cached generation, seq_len is typically 1.
            kv_cache: Optional tuple of (cached_keys, cached_values),
                each of shape (batch, n_head, cache_len, head_dim).

        Returns:
            A tuple of (output, new_kv_cache) where output has shape
            (batch, seq_len, n_embd) and new_kv_cache is a tuple
            of updated (keys, values) tensors.
        """
        B, T, C = x.size()

        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)

        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)

        # Append to cache if it exists
        if kv_cache is not None:
            k_cache, v_cache = kv_cache
            k = torch.cat([k_cache, k], dim=2)
            v = torch.cat([v_cache, v], dim=2)

        new_kv_cache = (k, v)

        # Compute attention
        att = (q @ k.transpose(-2, -1)) * (1.0 / (self.head_dim ** 0.5))

        # Apply causal mask only for the new positions
        # When using cache, we only need to mask within the new tokens
        if kv_cache is None and T > 1:
            mask = torch.triu(
                torch.ones(T, T, device=x.device), diagonal=1
            ).bool()
            att = att.masked_fill(
                mask.unsqueeze(0).unsqueeze(0), float('-inf')
            )

        att = F.softmax(att, dim=-1)
        y = att @ v

        y = y.transpose(1, 2).contiguous().view(B, T, C)
        y = self.c_proj(y)

        return y, new_kv_cache

21.8.4 Memory Trade-off

KV caching trades memory for speed. For a model with $L$ layers, $H$ heads, head dimension $d_k$, and a cache of length $S$:

$$\text{KV cache memory} = 2 \times L \times S \times H \times d_k \times \text{sizeof(dtype)}$$

For GPT-2 Large ($L=36$, $H=20$, $d_k=64$, $S=1024$, float16):

$$2 \times 36 \times 1024 \times 20 \times 64 \times 2 \text{ bytes} \approx 189 \text{ MB per sequence}$$

For larger models with longer context windows, the KV cache can consume gigabytes of memory, which is one of the bottlenecks in serving LLMs in production.

Several techniques have been developed to address this bottleneck:

  • Multi-Query Attention (MQA): Uses a single key-value head shared across all query heads, reducing the KV cache by a factor equal to the number of heads. Shazeer (2019) showed that this provides a significant speedup with minimal quality loss.
  • Grouped Query Attention (GQA): A compromise between full multi-head attention and MQA, grouping query heads into groups that share key-value heads. Llama 2 and Mistral use GQA with 8 KV heads for their larger variants.
  • Paged Attention: Used in the vLLM serving framework, paged attention manages KV cache memory like virtual memory pages, reducing waste from memory fragmentation and enabling efficient batching of requests with different sequence lengths.
  • Sliding Window KV Cache: Rather than storing keys and values for all past tokens, only store the most recent $W$ tokens. Combined with stacked layers (as in Mistral), this provides long-range context while bounding cache size.

21.9 Using GPT-2 with HuggingFace Transformers

Let us now see how to use a pre-trained GPT-2 model from the HuggingFace transformers library.

21.9.1 Loading the Model and Tokenizer

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained GPT-2 (124M parameters)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()

# Tokenize a prompt
prompt = "The future of artificial intelligence"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

print(f"Prompt: {prompt}")
print(f"Token IDs: {input_ids}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(input_ids[0])}")

21.9.2 Generation with Different Strategies

import torch

torch.manual_seed(42)

# Greedy decoding
greedy_output = model.generate(
    input_ids,
    max_new_tokens=50,
    do_sample=False,
)
print("Greedy:", tokenizer.decode(greedy_output[0], skip_special_tokens=True))

# Temperature sampling
temp_output = model.generate(
    input_ids,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
)
print("Temperature 0.7:", tokenizer.decode(temp_output[0], skip_special_tokens=True))

# Top-k sampling
topk_output = model.generate(
    input_ids,
    max_new_tokens=50,
    do_sample=True,
    top_k=50,
)
print("Top-k (k=50):", tokenizer.decode(topk_output[0], skip_special_tokens=True))

# Nucleus sampling
nucleus_output = model.generate(
    input_ids,
    max_new_tokens=50,
    do_sample=True,
    top_p=0.92,
)
print("Nucleus (p=0.92):", tokenizer.decode(nucleus_output[0], skip_special_tokens=True))

21.9.3 Analyzing Token Probabilities

import torch
import torch.nn.functional as F

torch.manual_seed(42)

with torch.no_grad():
    outputs = model(input_ids)
    logits = outputs.logits  # (1, seq_len, vocab_size)

# Get probabilities for the next token after the prompt
next_token_logits = logits[0, -1, :]
probs = F.softmax(next_token_logits, dim=-1)

# Top 10 most likely next tokens
top_probs, top_indices = torch.topk(probs, 10)
print("Top 10 next-token predictions:")
for i in range(10):
    token = tokenizer.decode(top_indices[i])
    print(f"  {token!r}: {top_probs[i]:.4f}")

21.9.4 Computing Perplexity

Perplexity is the standard evaluation metric for language models:

$$\text{PPL} = \exp\left(-\frac{1}{N}\sum_{t=1}^{N} \log P(x_t \mid \mathbf{x}_{

def compute_perplexity(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    text: str,
) -> float:
    """Compute the perplexity of a text under the model.

    Args:
        model: A GPT-2 language model.
        tokenizer: The corresponding tokenizer.
        text: The text to evaluate.

    Returns:
        The perplexity of the text.
    """
    input_ids = tokenizer.encode(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss  # Cross-entropy loss
    return torch.exp(loss).item()

text = "The quick brown fox jumps over the lazy dog."
ppl = compute_perplexity(model, tokenizer, text)
print(f"Perplexity: {ppl:.2f}")

A lower perplexity indicates that the model assigns higher probability to the text---in other words, the text is less "surprising" to the model.


21.10 Training a Mini-GPT on Text Data

Let us put our mini-GPT implementation to work by training it on a real text dataset.

21.10.1 Data Preparation

import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    """A simple character-level or token-level text dataset.

    Args:
        text: The raw text data.
        tokenizer: A tokenizer (or None for character-level).
        block_size: The context window size.
    """

    def __init__(
        self,
        text: str,
        tokenizer: GPT2Tokenizer | None,
        block_size: int,
    ) -> None:
        if tokenizer is not None:
            self.data = torch.tensor(
                tokenizer.encode(text), dtype=torch.long
            )
        else:
            # Character-level fallback
            chars = sorted(list(set(text)))
            self.stoi = {ch: i for i, ch in enumerate(chars)}
            self.itos = {i: ch for i, ch in enumerate(chars)}
            self.data = torch.tensor(
                [self.stoi[c] for c in text], dtype=torch.long
            )
        self.block_size = block_size

    def __len__(self) -> int:
        return len(self.data) - self.block_size

    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        x = self.data[idx: idx + self.block_size]
        y = self.data[idx + 1: idx + self.block_size + 1]
        return x, y

21.10.2 Training Loop

def train_mini_gpt(
    model: MiniGPT,
    dataset: TextDataset,
    epochs: int = 5,
    batch_size: int = 32,
    learning_rate: float = 3e-4,
    device: str = "cuda" if torch.cuda.is_available() else "cpu",
) -> list[float]:
    """Train the mini-GPT model on a text dataset.

    Args:
        model: The MiniGPT model to train.
        dataset: The training dataset.
        epochs: Number of training epochs.
        batch_size: Batch size.
        learning_rate: Learning rate for the optimizer.
        device: Device to train on.

    Returns:
        A list of average losses per epoch.
    """
    torch.manual_seed(42)
    model = model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    dataloader = DataLoader(
        dataset, batch_size=batch_size, shuffle=True, drop_last=True
    )

    epoch_losses = []
    for epoch in range(epochs):
        model.train()
        total_loss = 0.0
        num_batches = 0

        for x, y in dataloader:
            x, y = x.to(device), y.to(device)
            logits, loss = model(x, targets=y)

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            total_loss += loss.item()
            num_batches += 1

        avg_loss = total_loss / num_batches
        epoch_losses.append(avg_loss)
        print(f"Epoch {epoch + 1}/{epochs}, Loss: {avg_loss:.4f}")

    return epoch_losses

21.11 Repetition Penalties and Advanced Decoding

21.11.1 The Repetition Problem

Autoregressive models have a well-known tendency to produce repetitive text, especially with greedy or low-temperature decoding. This happens because once a phrase appears in the context, the model assigns higher probability to it appearing again (since repeated phrases are common in training data).

21.11.2 Repetition Penalty

The repetition penalty (Keskar et al., 2019) discourages the model from producing tokens that have already appeared in the generated text:

$$\tilde{z}_v = \begin{cases} z_v / \alpha & \text{if } z_v > 0 \text{ and } v \in \text{generated tokens} \\ z_v \times \alpha & \text{if } z_v < 0 \text{ and } v \in \text{generated tokens} \\ z_v & \text{otherwise} \end{cases}$$

where $\alpha > 1$ is the penalty factor. This reduces the probability of tokens that have already been generated, without completely prohibiting them.

21.11.3 Frequency and Presence Penalties

OpenAI's API popularized two related penalties:

  • Frequency penalty: Reduces the logit of a token proportionally to how many times it has appeared. Formula: $z_v' = z_v - \alpha_f \cdot \text{count}(v)$.
  • Presence penalty: Reduces the logit of a token if it has appeared at all. Formula: $z_v' = z_v - \alpha_p \cdot \mathbb{1}[v \in \text{generated}]$.

While less common for open-ended generation, beam search is still used for tasks like machine translation. It maintains $B$ candidate sequences (beams) and at each step expands each beam with the top-$k$ tokens, keeping only the best $B$ overall:

def beam_search(
    model: MiniGPT,
    idx: torch.Tensor,
    max_new_tokens: int,
    beam_width: int = 5,
) -> torch.Tensor:
    """Generate text using beam search.

    Args:
        model: A trained MiniGPT model.
        idx: Initial token indices of shape (1, seq_len).
        max_new_tokens: Number of new tokens to generate.
        beam_width: Number of beams to maintain.

    Returns:
        The best sequence of token indices.
    """
    model.eval()
    device = idx.device

    # Initialize beams: (sequence, cumulative_log_prob)
    beams = [(idx, 0.0)]

    for _ in range(max_new_tokens):
        all_candidates = []
        for seq, score in beams:
            seq_cond = seq[:, -model.config.block_size:]
            logits, _ = model(seq_cond)
            log_probs = F.log_softmax(logits[:, -1, :], dim=-1)

            # Get top-k continuations for this beam
            top_log_probs, top_indices = torch.topk(
                log_probs, beam_width, dim=-1
            )

            for i in range(beam_width):
                new_seq = torch.cat(
                    [seq, top_indices[:, i:i + 1]], dim=1
                )
                new_score = score + top_log_probs[0, i].item()
                all_candidates.append((new_seq, new_score))

        # Keep the top beam_width candidates
        beams = sorted(
            all_candidates, key=lambda x: x[1], reverse=True
        )[:beam_width]

    # Return the best beam
    return beams[0][0]

21.12 Understanding Attention Patterns in Decoder Models

21.12.1 Attention Head Specialization

Research has shown that different attention heads in GPT-style models learn to specialize in different roles:

  • Positional heads: Attend to specific relative positions (e.g., always attend to the previous token, or the token two positions back).
  • Syntactic heads: Track syntactic relationships like subject-verb agreement.
  • Induction heads: Implement a form of in-context copying by attending to tokens that follow a pattern similar to the current context. These heads are believed to be crucial for in-context learning.
  • Rare-word heads: Attend strongly to rare or informative tokens in the context.

21.12.2 Visualizing Attention

def visualize_attention(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    text: str,
    layer: int = 0,
    head: int = 0,
) -> None:
    """Extract and display attention weights from a GPT-2 model.

    Args:
        model: A GPT-2 model with output_attentions enabled.
        tokenizer: The GPT-2 tokenizer.
        text: Input text to analyze.
        layer: Which layer's attention to visualize.
        head: Which attention head to visualize.
    """
    import matplotlib.pyplot as plt

    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)

    # Extract attention weights: (n_layers, batch, n_heads, seq_len, seq_len)
    attention = outputs.attentions[layer][0, head].numpy()
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

    fig, ax = plt.subplots(figsize=(10, 8))
    im = ax.imshow(attention, cmap="viridis")
    ax.set_xticks(range(len(tokens)))
    ax.set_yticks(range(len(tokens)))
    ax.set_xticklabels(tokens, rotation=90)
    ax.set_yticklabels(tokens)
    ax.set_xlabel("Key")
    ax.set_ylabel("Query")
    ax.set_title(f"Layer {layer}, Head {head}")
    plt.colorbar(im)
    plt.tight_layout()
    plt.savefig("attention_pattern.png", dpi=150)
    plt.show()

The triangular structure of the attention matrix is clearly visible: each row (query) only has non-zero attention weights for columns (keys) up to and including its own position.


21.13 Positional Encoding Variants

21.13.1 Learned Absolute Positions (GPT-1/2/3)

The original GPT models use learned position embeddings: a separate embedding vector for each absolute position. This is simple and effective but limits the model to the maximum sequence length seen during training.

21.13.2 Rotary Positional Embeddings (RoPE)

Modern decoder-only models (e.g., Llama, Mistral) often use Rotary Positional Embeddings (Su et al., 2021). RoPE encodes position information by rotating the query and key vectors:

$$\text{RoPE}(x_m, m) = R_m x_m$$

where $R_m$ is a rotation matrix that depends on position $m$. The key property is that the dot product $\langle R_m q, R_n k \rangle$ depends only on the relative position $m - n$, giving the model a natural sense of distance that generalizes to longer sequences.

21.13.3 ALiBi (Attention with Linear Biases)

ALiBi (Press et al., 2022) takes a different approach: instead of adding positional information to the embeddings, it adds a linear bias to the attention scores that penalizes distant positions:

$$\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} - m \cdot |i - j|\right)$$

where $m$ is a head-specific slope. This approach allows models to extrapolate to longer sequences than they were trained on.


21.14 Decoder-Only Models Beyond GPT

21.14.1 Llama and the Open-Source Revolution

Meta's Llama models (Touvron et al., 2023) are decoder-only Transformers that incorporate several modern improvements:

  • RoPE instead of learned positional embeddings.
  • SwiGLU activation instead of GELU: $\text{SwiGLU}(x) = \text{Swish}(xW_1) \otimes xV$, where $\otimes$ is element-wise multiplication.
  • RMSNorm instead of LayerNorm, which normalizes by the root mean square without centering.
  • Grouped Query Attention (GQA): Shares key-value heads across multiple query heads, reducing the KV cache size.

21.14.2 Mistral and the Efficient Frontier

Mistral AI's models (Jiang et al., 2023) introduced several architectural innovations aimed at improving efficiency without sacrificing quality:

  • Sliding Window Attention (SWA): Instead of attending to all previous tokens, each layer attends only to a fixed window of the most recent $W$ tokens. With $L$ layers, the effective receptive field is $L \times W$ tokens, providing long-range context through stacked layers without the quadratic cost of full attention.
  • Mixture of Experts (MoE): Mixtral 8x7B uses a sparse mixture-of-experts architecture where each token is routed to 2 of 8 expert FFN modules per layer. The total parameter count is 46.7B, but only ~12.9B parameters are active per token, achieving the quality of a much larger dense model at the inference cost of a smaller one.
  • Grouped Query Attention: Like Llama 2, Mistral uses GQA to reduce KV cache memory, which is essential for efficient serving.

Mistral 7B outperformed the much larger Llama 2 13B on most benchmarks, demonstrating that architectural innovation and data quality can compensate for raw parameter count.

21.14.3 Falcon and Data-Centric Scaling

The Technology Innovation Institute's Falcon models (Penedo et al., 2023) demonstrated the importance of data curation. Falcon was trained on the RefinedWeb dataset, a carefully filtered and deduplicated subset of Common Crawl. The key insight was that web data, when properly curated, can match or exceed the quality of manually assembled datasets. Falcon 40B achieved state-of-the-art results among open models at the time of release, with its success attributed primarily to data quality rather than architectural novelty.

21.14.4 Phi: The Small Model Revolution

Microsoft's Phi series (Gunasekar et al., 2023) challenged the assumption that scale is necessary for strong performance. Phi-1 (1.3B parameters) achieved remarkable coding ability by training on carefully selected "textbook-quality" data. The subsequent Phi-2 (2.7B) and Phi-3 (3.8B) models extended this philosophy to general language understanding. The Phi results suggest that data quality can substitute for model size to a surprising degree, an insight that is especially important for deployment scenarios where inference cost must be minimized.

21.14.5 The Open-Source Landscape at a Glance

Model Organization Parameters Key Innovation
Chinchilla DeepMind 70B Compute-optimal training (as we saw in Section 21.4.5)
PaLM / PaLM 2 Google 540B / undisclosed Pathway system, multilingual strength
Falcon TII 7B / 40B / 180B Curated web data (RefinedWeb)
Mistral / Mixtral Mistral AI 7B / 8x7B Sliding window attention, MoE
Phi-1/2/3 Microsoft 1.3B--14B High-quality small-data training
Llama 1/2/3 Meta 7B--405B RoPE, GQA, SwiGLU, open weights
Gemma Google DeepMind 2B / 7B / 27B Efficient training, multi-query attention
Qwen Alibaba 0.5B--72B Strong multilingual, coding ability

The open-source ecosystem has evolved rapidly, and the choice of base model depends on the specific use case: Phi for latency-sensitive deployment, Llama 3 for general-purpose applications, Mixtral for throughput-optimized serving, and specialized models for domain-specific tasks. We will explore how to fine-tune these models in Chapter 24 and align them in Chapter 25.


21.15 Speculative Decoding

21.15.1 The Inference Bottleneck

Autoregressive generation is inherently sequential: each token must be generated before the next can begin. For large models, each forward pass is expensive, and the total generation time scales linearly with the number of output tokens. This makes generation the primary bottleneck in production LLM systems, especially for long outputs.

21.15.2 The Speculative Decoding Idea

Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) accelerates generation by using a smaller, faster draft model to propose multiple tokens at once, then verifying those tokens in parallel with the larger target model. The key insight is that verification is much cheaper than generation: the target model can check $K$ proposed tokens in a single forward pass (the same cost as generating one token), whereas generating those tokens one at a time would require $K$ forward passes.

The algorithm works as follows:

  1. The draft model $q$ generates $K$ candidate tokens autoregressively: $\tilde{y}_1, \tilde{y}_2, \ldots, \tilde{y}_K$.
  2. The target model $p$ processes all $K$ tokens in a single forward pass, computing $p(y_t | y_{
  3. Each candidate token is accepted or rejected using a rejection sampling scheme that ensures the final distribution exactly matches the target model's distribution.
  4. If token $\tilde{y}_t$ is rejected, a corrected token is sampled from a modified distribution, and all subsequent draft tokens are discarded.

21.15.3 Acceptance Criterion

For each draft token $\tilde{y}_t$, the acceptance probability is:

$$\alpha_t = \min\left(1, \frac{p(\tilde{y}_t | y_{

If $\tilde{y}_t$ is rejected (with probability $1 - \alpha_t$), a corrected token is sampled from:

$$y_t \sim \text{norm}\left(\max\left(0, p(\cdot | y_{

This rejection sampling scheme guarantees that the output distribution is identical to sampling from the target model alone---speculative decoding provides a speedup with no quality degradation.

21.15.4 Expected Speedup

The speedup depends on how well the draft model approximates the target model. If the average acceptance rate per token is $\alpha$, the expected number of tokens accepted per verification step is:

$$\mathbb{E}[\text{accepted tokens}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}$$

In practice, with a well-chosen draft model, speculative decoding achieves 2-3x speedup. The draft model is typically a smaller model from the same family (e.g., Llama-7B drafting for Llama-70B) or a purpose-trained lightweight model.


21.16 Common Pitfalls and Debugging Tips

21.16.1 Training Instabilities

  1. Loss spikes: Sudden increases in loss during training. Common causes include learning rate too high, insufficient warmup, or numerical overflow in attention. Solutions: gradient clipping, learning rate warmup, mixed-precision training with loss scaling.

  2. Degenerate repetition during generation: If the model generates the same token or phrase repeatedly, try: - Increasing temperature. - Using top-p sampling instead of greedy decoding. - Adding a repetition penalty.

  3. Context length mismatch: If you train with a block size of 256 but try to generate with a context longer than 256 tokens, the model will either crash or produce garbage. Always respect the training context length.

21.16.2 Common Implementation Bugs

  1. Forgetting the causal mask: Without the mask, the model can "see the future" during training and learns to cheat. The training loss will drop rapidly but the model will not generate coherent text.

  2. Off-by-one errors in targets: The target sequence should be the input sequence shifted by one position. Make sure targets[t] = input[t + 1].

  3. Not applying the mask correctly during cached generation: When using KV cache, the mask should only apply to the new tokens, not the cached keys.

21.16.3 Performance Optimization

  1. Flash Attention: Use torch.nn.functional.scaled_dot_product_attention with is_causal=True for optimized attention that fuses the mask, softmax, and matrix multiplication into a single kernel.

  2. Mixed precision: Use torch.cuda.amp for automatic mixed-precision training, which can provide a 2-3x speedup.

  3. Gradient accumulation: If your GPU cannot fit large batch sizes, accumulate gradients over multiple mini-batches.


21.17 Practical Generation with HuggingFace: Extended Examples

Beyond the basic generation examples shown in Section 21.9, production systems often require more nuanced control over the generation process. Here we present several common patterns.

21.17.1 Streaming Generation

For interactive applications, streaming delivers tokens to the user as they are generated, reducing perceived latency:

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

streamer = TextStreamer(tokenizer, skip_prompt=True)

prompt = "Explain the concept of attention in neural networks:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    streamer=streamer,
)

21.17.2 Batch Generation for Throughput

When processing many prompts, batching amortizes the cost of model loading and memory transfers:

prompts = [
    "Summarize the theory of relativity:",
    "Explain quantum entanglement:",
    "Describe natural selection:",
]

tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
    pad_token_id=tokenizer.eos_token_id,
)

for i, output in enumerate(outputs):
    text = tokenizer.decode(output, skip_special_tokens=True)
    print(f"Prompt {i+1}: {text}\n")

21.17.3 Constrained Generation with Stopping Criteria

Production systems often need generation to stop at specific tokens or patterns:

from transformers import StoppingCriteria, StoppingCriteriaList

class StopOnNewline(StoppingCriteria):
    """Stop generation when a double newline is produced."""
    def __init__(self, tokenizer):
        self.stop_ids = tokenizer.encode("\n\n", add_special_tokens=False)

    def __call__(self, input_ids, scores, **kwargs):
        if len(self.stop_ids) > 0:
            for seq in input_ids:
                if seq[-len(self.stop_ids):].tolist() == self.stop_ids:
                    return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnNewline(tokenizer)])

output = model.generate(
    **inputs,
    max_new_tokens=500,
    stopping_criteria=stopping_criteria,
)

These patterns form the building blocks of production LLM applications. As we will explore in Chapter 23, the prompts fed to these generation pipelines are themselves an engineering discipline.


21.18 From Language Modeling to Instruction Following

The autoregressive language model is the foundation, but the models people interact with (ChatGPT, Claude) undergo additional training stages:

  1. Pre-training: Next-token prediction on a large corpus (what we covered in this chapter).
  2. Supervised fine-tuning (SFT): Training on curated instruction-response pairs.
  3. Reinforcement learning from human feedback (RLHF) or Direct Preference Optimization (DPO): Aligning the model with human preferences.

We will explore these alignment techniques in Chapter 25. The key insight is that the base autoregressive model provides the "knowledge" and "capability," while alignment training shapes the model's behavior to be helpful, harmless, and honest.


21.19 Chapter Summary

In this chapter, we have explored the decoder-only Transformer architecture and the autoregressive language modeling paradigm:

  1. Autoregressive factorization decomposes a sequence probability into a product of conditional probabilities, each predicting the next token given all previous tokens.

  2. Causal masking enforces the autoregressive property during training by preventing each position from attending to future positions, while still allowing parallel computation of all predictions within a sequence.

  3. The GPT family (GPT-1, GPT-2, GPT-3) demonstrated that scaling a simple decoder-only architecture with a next-token prediction objective produces increasingly capable language models, culminating in the emergence of in-context learning.

  4. Text generation strategies including greedy decoding, temperature scaling, top-k sampling, and nucleus (top-p) sampling provide different trade-offs between quality and diversity. In practice, these are often combined.

  5. KV caching is essential for efficient autoregressive generation, eliminating redundant computation of keys and values for previously processed tokens.

  6. Modern decoder-only models (Llama, Mistral, etc.) incorporate improvements like RoPE, SwiGLU, RMSNorm, and grouped query attention, but the fundamental architecture remains a causally-masked Transformer decoder.

  7. Speculative decoding accelerates inference by using a fast draft model to propose multiple tokens, which are then verified in parallel by the target model, achieving 2-3x speedup with no quality loss.

  8. The open-source ecosystem has matured rapidly, offering a range of models from compact (Phi, 1-3B) to frontier-scale (Llama 3, 405B), each with distinct strengths for different deployment scenarios.

The decoder-only autoregressive model is arguably the most important architecture in modern AI. Its simplicity---predict the next token---belies the extraordinary depth of capability that emerges at scale. In the next chapter, we will examine how these models scale, both the theoretical scaling laws that govern their behavior and the practical considerations of training models with hundreds of billions of parameters.