29 min read

> "Language modeling is compression, and compression is understanding."

Chapter 11: Large Language Models — Architecture, Training, Fine-Tuning, RAG, and Practical Applications

"Language modeling is compression, and compression is understanding." — Ilya Sutskever, attributed (2023)


Learning Objectives

By the end of this chapter, you will be able to:

  1. Explain the architecture of decoder-only transformer LLMs and how they differ from encoder-decoder models
  2. Describe the three-stage training pipeline: pretraining, instruction tuning, and RLHF/DPO
  3. Implement parameter-efficient fine-tuning (LoRA, QLoRA) for domain adaptation
  4. Build a RAG pipeline that grounds LLM responses in retrieved documents
  5. Evaluate LLM outputs systematically (automated metrics, human evaluation, LLM-as-judge)

11.1 From Transformers to Language Models

In Chapter 10, we derived the transformer architecture from first principles: scaled dot-product attention, multi-head attention, positional encoding, and the residual stream. We implemented a complete transformer and used it to model sequential data. This chapter answers the question: what happens when you scale that architecture to billions of parameters and train it on trillions of tokens of text?

The answer is a large language model (LLM) — and its emergence as a general-purpose reasoning system is arguably the most consequential development in machine learning since the transformer itself.

What Is a Language Model?

A language model assigns probabilities to sequences of tokens. Given a sequence $x_1, x_2, \ldots, x_{t-1}$, the model estimates:

$$P(x_t \mid x_1, x_2, \ldots, x_{t-1})$$

This is next-token prediction — the single training objective behind GPT, LLaMA, and every decoder-only LLM. The model learns to predict the next token given all preceding tokens, and this deceptively simple objective turns out to be sufficient for learning grammar, facts, reasoning patterns, and even rudimentary world models.

The key insight is that next-token prediction on sufficiently diverse data is implicitly a compression task. To predict the next word in a Wikipedia article about quantum mechanics, the model must encode an internal representation of quantum mechanics. To predict the next token in Python source code, it must learn Python syntax and semantics. The training objective does not explicitly require understanding — but understanding (or a functional approximation of it) emerges because it is the most efficient way to minimize the loss.

Autoregressive Generation

At inference time, an LLM generates text autoregressively: it predicts one token, appends it to the context, and predicts the next. Given a prompt $x_1, \ldots, x_n$, generation proceeds:

$$x_{n+1} \sim P_\theta(x \mid x_1, \ldots, x_n)$$ $$x_{n+2} \sim P_\theta(x \mid x_1, \ldots, x_n, x_{n+1})$$ $$\vdots$$

Each token is sampled from the model's predicted distribution. The sampling strategy — greedy (argmax), top-$k$, top-$p$ (nucleus), or temperature-scaled — controls the trade-off between coherence and diversity. We will formalize these strategies in Section 11.7.

Perplexity: The Language Model's Loss

The standard evaluation metric for language models is perplexity, defined as the exponentiated cross-entropy loss:

$$\text{PPL} = \exp\left(-\frac{1}{T}\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{

where $T$ is the sequence length and $x_{

Modern LLMs achieve perplexities in the range of 5-15 on standard benchmarks like WikiText-103. To put this in perspective: English has roughly 50,000 common words, so a perplexity of 10 means the model has eliminated 99.98% of the vocabulary at each position. This is extraordinary compression of linguistic structure.

Fundamentals > Frontier: Perplexity connects directly to cross-entropy (Chapter 4) and information theory. The cross-entropy $H(p, q) = -\mathbb{E}_{x \sim p}[\log q(x)]$ measures how many bits the model's distribution $q$ needs to encode samples from the true distribution $p$. Perplexity = $2^{H(p,q)}$ (or $e^{H(p,q)}$ in natural log convention). Every improvement in perplexity means the model has learned to compress language more efficiently.


11.2 Decoder-Only Transformer Architecture

Chapter 10 covered the full encoder-decoder transformer from "Attention Is All You Need" (Vaswani et al., 2017). Modern LLMs use a simplified variant: the decoder-only architecture, which consists of a stack of transformer blocks with causal (autoregressive) masking.

Why Decoder-Only?

The original transformer had two distinct components:

  • Encoder: Processes the input with bidirectional self-attention (each token attends to all other tokens).
  • Decoder: Generates the output with causal self-attention (each token attends only to previous tokens) plus cross-attention to the encoder's outputs.

For language modeling, the encoder is unnecessary. The task is simply: given tokens $x_1, \ldots, x_{t-1}$, predict $x_t$. A decoder-only model processes the entire sequence with causal self-attention — each position can attend to itself and all preceding positions, but not to future positions. This is enforced by the causal mask:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$$

where $M$ is an upper-triangular matrix of $-\infty$ values:

$$M_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}$$

The $-\infty$ entries become 0 after softmax, preventing position $i$ from attending to any position $j > i$.

The practical advantage of decoder-only models is simplicity: one architecture serves both "understanding" (through the prompt) and "generation" (through autoregressive decoding). The prompt is processed in parallel (all positions computed simultaneously), and generation proceeds one token at a time, using the KV-cache to avoid recomputing attention for previous positions.

The GPT Architecture in Detail

The canonical decoder-only architecture (GPT-style) consists of:

  1. Token embedding: A learned lookup table $\mathbf{E} \in \mathbb{R}^{V \times d}$ mapping each token in the vocabulary ($V$ tokens) to a $d$-dimensional vector.

  2. Positional encoding: Either sinusoidal (original transformer), learned position embeddings, or rotary position embeddings (RoPE). Modern LLMs overwhelmingly use RoPE (Su et al., 2021), which encodes relative position information by rotating the query and key vectors:

$$\text{RoPE}(x_m, m) = \begin{pmatrix} x_m^{(1)} \cos(m\theta_1) - x_m^{(2)} \sin(m\theta_1) \\ x_m^{(1)} \sin(m\theta_1) + x_m^{(2)} \cos(m\theta_1) \\ \vdots \\ x_m^{(d-1)} \cos(m\theta_{d/2}) - x_m^{(d)} \sin(m\theta_{d/2}) \\ x_m^{(d-1)} \sin(m\theta_{d/2}) + x_m^{(d)} \cos(m\theta_{d/2}) \end{pmatrix}$$

where $\theta_i = 10000^{-2i/d}$ and $m$ is the position index. The critical property is that the dot product between rotated queries and keys depends only on relative position $m - n$, not absolute positions.

  1. Transformer blocks (repeated $L$ times): Each block contains: - Pre-layer normalization (RMSNorm in modern architectures) - Multi-head causal self-attention with $h$ heads, each of dimension $d_k = d/h$ - Residual connection - Pre-layer normalization (again) - Feed-forward network — typically a gated variant: SwiGLU (Shazeer, 2020), computed as $\text{FFN}(x) = (W_1 x \odot \sigma(W_g x)) W_2$, where $\sigma$ is the SiLU activation and $\odot$ is element-wise multiplication - Residual connection

  2. Language model head: A linear projection $\mathbf{W}_{\text{head}} \in \mathbb{R}^{d \times V}$ mapping the final hidden state to logits over the vocabulary (often tied with the token embedding matrix: $\mathbf{W}_{\text{head}} = \mathbf{E}^T$).

Concrete Dimensions

To ground the architecture, consider LLaMA 2-7B:

Component Value
Vocabulary size $V$ 32,000
Hidden dimension $d$ 4,096
Number of layers $L$ 32
Number of attention heads $h$ 32
Head dimension $d_k$ 128
FFN intermediate dimension 11,008
Context length 4,096 tokens
Total parameters ~6.7 billion

The parameter distribution is instructive. Each transformer block has:

  • Self-attention: $Q$, $K$, $V$ projections: $3 \times (d \times d) = 3 \times 4096^2 \approx 50$M parameters. Output projection: $d \times d \approx 16.8$M. Total: ~67M per block.
  • FFN: With SwiGLU: $3 \times (d \times d_{\text{ff}}) = 3 \times 4096 \times 11008 \approx 135$M per block.

With 32 blocks: $(67\text{M} + 135\text{M}) \times 32 \approx 6.5$B, plus embeddings ($V \times d = 32000 \times 4096 \approx 131$M).

Production ML = Software Engineering: Understanding parameter counts is not academic trivia. Memory footprint determines which GPUs you need. In fp16, the 7B model requires $7 \times 10^9 \times 2 = 14$ GB just for parameters — before activations, optimizer states, or KV-cache. With Adam optimizer states (8 bytes per parameter) and gradients (2 bytes per parameter), training requires approximately $7 \times 10^9 \times (2 + 8 + 2) = 84$ GB — already exceeding a single A100-80GB. This is why distributed training is not optional at this scale.

Implementation: A Minimal GPT Block

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from typing import Optional, Tuple


class RMSNorm(nn.Module):
    """Root Mean Square Layer Normalization (Zhang & Sennrich, 2019).

    Simplifies LayerNorm by removing the mean-centering step.
    Empirically equivalent performance, computationally cheaper.

    Args:
        dim: Hidden dimension.
        eps: Numerical stability constant.
    """

    def __init__(self, dim: int, eps: float = 1e-6) -> None:
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight


class CausalSelfAttention(nn.Module):
    """Multi-head causal self-attention with KV-cache support.

    Args:
        dim: Hidden dimension.
        n_heads: Number of attention heads.
        max_seq_len: Maximum sequence length for causal mask.
    """

    def __init__(self, dim: int, n_heads: int, max_seq_len: int = 4096) -> None:
        super().__init__()
        assert dim % n_heads == 0, "dim must be divisible by n_heads"
        self.n_heads = n_heads
        self.head_dim = dim // n_heads
        self.scale = self.head_dim ** -0.5

        self.wq = nn.Linear(dim, dim, bias=False)
        self.wk = nn.Linear(dim, dim, bias=False)
        self.wv = nn.Linear(dim, dim, bias=False)
        self.wo = nn.Linear(dim, dim, bias=False)

        # Pre-compute causal mask
        mask = torch.triu(
            torch.full((max_seq_len, max_seq_len), float("-inf")),
            diagonal=1,
        )
        self.register_buffer("causal_mask", mask)

    def forward(
        self,
        x: torch.Tensor,
        kv_cache: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
    ) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        """Forward pass with optional KV-cache for efficient generation.

        Args:
            x: Input tensor of shape (batch, seq_len, dim).
            kv_cache: Optional tuple of cached (K, V) tensors.

        Returns:
            Output tensor and updated KV-cache.
        """
        B, T, C = x.shape

        q = self.wq(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        k = self.wk(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        v = self.wv(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)

        # Append to KV-cache if provided (for autoregressive generation)
        if kv_cache is not None:
            k_prev, v_prev = kv_cache
            k = torch.cat([k_prev, k], dim=2)
            v = torch.cat([v_prev, v], dim=2)
        new_cache = (k, v)

        # Scaled dot-product attention with causal mask
        attn_scores = (q @ k.transpose(-2, -1)) * self.scale
        seq_len_k = k.shape[2]
        seq_len_q = q.shape[2]
        # Apply causal mask for the relevant positions
        mask = self.causal_mask[seq_len_k - seq_len_q:seq_len_k, :seq_len_k]
        attn_scores = attn_scores + mask
        attn_weights = F.softmax(attn_scores, dim=-1)

        out = (attn_weights @ v).transpose(1, 2).contiguous().view(B, T, C)
        out = self.wo(out)
        return out, new_cache


class SwiGLU(nn.Module):
    """SwiGLU feed-forward network (Shazeer, 2020).

    Uses gated linear units with SiLU activation.
    The gating mechanism allows the network to learn which
    features to pass through, improving expressiveness
    over standard ReLU feed-forward layers.

    Args:
        dim: Input/output dimension.
        hidden_dim: Intermediate dimension.
    """

    def __init__(self, dim: int, hidden_dim: int) -> None:
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w2(F.silu(self.w1(x)) * self.w3(x))


class TransformerBlock(nn.Module):
    """A single transformer block with pre-norm architecture.

    Args:
        dim: Hidden dimension.
        n_heads: Number of attention heads.
        ff_dim: Feed-forward intermediate dimension.
        max_seq_len: Maximum sequence length.
    """

    def __init__(
        self, dim: int, n_heads: int, ff_dim: int, max_seq_len: int = 4096
    ) -> None:
        super().__init__()
        self.attention_norm = RMSNorm(dim)
        self.attention = CausalSelfAttention(dim, n_heads, max_seq_len)
        self.ffn_norm = RMSNorm(dim)
        self.ffn = SwiGLU(dim, ff_dim)

    def forward(
        self,
        x: torch.Tensor,
        kv_cache: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
    ) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        # Pre-norm + attention + residual
        h, new_cache = self.attention(self.attention_norm(x), kv_cache)
        x = x + h
        # Pre-norm + FFN + residual
        x = x + self.ffn(self.ffn_norm(x))
        return x, new_cache


class MiniGPT(nn.Module):
    """Minimal GPT-style decoder-only language model.

    This implementation captures the essential architecture of
    modern LLMs (LLaMA, Mistral, etc.) without production
    optimizations (flash attention, tensor parallelism).

    Args:
        vocab_size: Size of token vocabulary.
        dim: Hidden dimension.
        n_layers: Number of transformer blocks.
        n_heads: Number of attention heads.
        ff_dim: Feed-forward intermediate dimension.
        max_seq_len: Maximum sequence length.
    """

    def __init__(
        self,
        vocab_size: int = 32000,
        dim: int = 512,
        n_layers: int = 8,
        n_heads: int = 8,
        ff_dim: int = 1376,
        max_seq_len: int = 2048,
    ) -> None:
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, dim)
        self.layers = nn.ModuleList(
            [
                TransformerBlock(dim, n_heads, ff_dim, max_seq_len)
                for _ in range(n_layers)
            ]
        )
        self.norm = RMSNorm(dim)
        self.lm_head = nn.Linear(dim, vocab_size, bias=False)

        # Weight tying: share embedding and output projection
        self.lm_head.weight = self.tok_emb.weight

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module: nn.Module) -> None:
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(
        self,
        tokens: torch.Tensor,
        kv_caches: Optional[list] = None,
    ) -> Tuple[torch.Tensor, list]:
        """Forward pass returning logits and updated KV-caches.

        Args:
            tokens: Token indices of shape (batch, seq_len).
            kv_caches: List of per-layer KV-caches for generation.

        Returns:
            Logits of shape (batch, seq_len, vocab_size) and
            updated KV-caches.
        """
        x = self.tok_emb(tokens)
        new_caches = []

        for i, layer in enumerate(self.layers):
            cache = kv_caches[i] if kv_caches is not None else None
            x, new_cache = layer(x, cache)
            new_caches.append(new_cache)

        x = self.norm(x)
        logits = self.lm_head(x)
        return logits, new_caches

    def count_parameters(self) -> int:
        """Return total number of trainable parameters."""
        return sum(p.numel() for p in self.parameters() if p.requires_grad)


# Verify the architecture
model = MiniGPT(vocab_size=32000, dim=512, n_layers=8, n_heads=8)
print(f"Parameters: {model.count_parameters():,}")

tokens = torch.randint(0, 32000, (2, 128))
logits, caches = model(tokens)
print(f"Input shape:  {tokens.shape}")
print(f"Output shape: {logits.shape}")
Parameters: 62,996,480
Input shape:  torch.Size([2, 128])
Output shape: torch.Size([2, 128, 32000])

11.3 Tokenization: How Text Becomes Numbers

Before any text reaches the transformer, it must be converted to a sequence of integer token IDs. This is the job of the tokenizer, and the choice of tokenization strategy has surprisingly large effects on model behavior.

Byte-Pair Encoding (BPE)

Most modern LLMs use byte-pair encoding (Sennrich et al., 2016), which builds a vocabulary by iteratively merging the most frequent pair of adjacent tokens:

  1. Start with a vocabulary of individual characters (or bytes).
  2. Count all adjacent pairs in the training corpus.
  3. Merge the most frequent pair into a new token.
  4. Repeat until the vocabulary reaches the target size (typically 32K-128K tokens).

The result is a vocabulary that represents common words as single tokens ("the" $\to$ [the]), rare words as sequences of subword tokens ("autoregressively" $\to$ [auto, reg, ress, ively]), and never encounters an out-of-vocabulary word (because it can always fall back to individual characters or bytes).

SentencePiece

SentencePiece (Kudo and Richardson, 2018) extends BPE by operating on raw text without pre-tokenization — it treats the input as a raw byte stream and learns subword segmentation directly. This makes it language-agnostic and avoids the need for language-specific word boundary rules. LLaMA, Mistral, and many multilingual models use SentencePiece.

Why Tokenization Matters

Tokenization has concrete consequences:

  1. Cost. LLM APIs charge per token. An inefficient tokenizer means more tokens per document, which means higher cost and more computation.
  2. Context window. The same text consumes different numbers of tokens under different tokenizers. GPT-4's tokenizer represents English at roughly 0.75 tokens per word; less efficient tokenizers for non-Latin scripts can use 3-5x more tokens for the same semantic content.
  3. Arithmetic and reasoning. Tokenizers often split numbers in arbitrary ways ("12345" $\to$ ["123", "45"]), which makes arithmetic harder for the model — it must learn to compose multi-token numbers rather than operate on them atomically.
from transformers import AutoTokenizer

# Compare tokenization across models
tokenizer_llama = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")

text = "The autoregressive transformer generates tokens one at a time."
llama_tokens = tokenizer_llama.tokenize(text)
gpt2_tokens = tokenizer_gpt2.tokenize(text)

print(f"LLaMA tokens ({len(llama_tokens)}): {llama_tokens}")
print(f"GPT-2 tokens ({len(gpt2_tokens)}): {gpt2_tokens}")
LLaMA tokens (10): ['The', '▁aut', 'ore', 'gressive', '▁transformer', '▁generates', '▁tokens', '▁one', '▁at', '▁a', '▁time', '.']
GPT-2 tokens (10): ['The', 'Ġautoregressive', 'Ġtransformer', 'Ġgenerates', 'Ġtokens', 'Ġone', 'Ġat', 'Ġa', 'Ġtime', '.']

11.4 The Three-Stage Training Pipeline

Modern LLMs are not trained in a single step. The training pipeline has three distinct stages, each with different data, objectives, and goals.

Stage 1: Pretraining

Pretraining trains the model on massive unlabeled text with the next-token prediction objective:

$$\mathcal{L}_{\text{pretrain}} = -\frac{1}{T}\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{

The data is vast and diverse: web crawls (Common Crawl, refined via quality filtering), books, Wikipedia, code repositories, scientific papers. LLaMA 2 was trained on approximately 2 trillion tokens. GPT-4 is believed to have used significantly more.

The key parameters for pretraining are:

Parameter Typical Range
Dataset size 1-15 trillion tokens
Model size 1B-540B parameters
Learning rate $1 \times 10^{-4}$ to $3 \times 10^{-4}$ (with warmup + cosine decay)
Batch size 2M-4M tokens
Training duration Weeks to months on thousands of GPUs

Scaling Laws and the Chinchilla Result

How should you allocate a fixed compute budget between model size and data? Kaplan et al. (2020) proposed power-law scaling relationships, but the landmark result is from Hoffmann et al. (2022) — the Chinchilla paper.

Hoffmann et al. trained over 400 models ranging from 70M to 16B parameters on varying amounts of data and found that for a given compute budget $C$, the optimal allocation is:

$$N_{\text{opt}} \propto C^{0.50}, \qquad D_{\text{opt}} \propto C^{0.50}$$

where $N$ is the number of parameters and $D$ is the number of training tokens. The critical implication: model size and data size should be scaled equally. Specifically, the optimal ratio is approximately 20 tokens per parameter.

This meant that many existing models were undertrained. Chinchilla (70B parameters, 1.4 trillion tokens) matched the performance of Gopher (280B parameters, 300 billion tokens) despite being 4x smaller — because Gopher had received only ~1 token per parameter, far below the optimal ~20.

The practical consequences of scaling laws are profound:

  1. For a fixed compute budget, a smaller model trained on more data often outperforms a larger model trained on less data. This directly affects hardware procurement decisions.
  2. Inference cost scales linearly with model size but is independent of training data size. A Chinchilla-optimal 70B model is 4x cheaper to serve than a 280B model with the same performance.
  3. Data quality becomes the binding constraint once you are scaling data to meet the Chinchilla ratio. This has driven massive investment in data curation.

Fundamentals > Frontier: Scaling laws are empirical power laws, not theorems. They extrapolate within the range of experiments but may not hold indefinitely. The LLaMA 3 family (Meta, 2024) trained a 8B model on 15 trillion tokens — a ratio of nearly 1,900 tokens per parameter, far exceeding the Chinchilla-optimal ~20. This "over-training" makes the model worse per compute dollar during training but dramatically cheaper at inference. The frontier has shifted from "minimize training loss for fixed compute" to "minimize the total cost of ownership across training and inference."

Stage 2: Supervised Fine-Tuning (SFT) / Instruction Tuning

A pretrained LLM is a powerful text completer — but it is not a useful assistant. Given the prompt "What is the capital of France?", a pretrained model might continue with "What is the capital of Germany?" (because the training data contains lists of questions) rather than answering "Paris."

Supervised fine-tuning (SFT) teaches the model to follow instructions by training on curated (prompt, response) pairs:

$$\mathcal{L}_{\text{SFT}} = -\frac{1}{T_{\text{response}}}\sum_{t \in \text{response}} \log P_\theta(x_t \mid x_{

Note that the loss is computed only over the response tokens, not the prompt. The prompt provides context, but we do not want to "penalize" the model for the prompt's content.

The SFT dataset is much smaller than the pretraining data — typically 10,000 to 100,000 high-quality examples — but each example is carefully crafted. InstructGPT used approximately 13,000 demonstration examples written by human contractors. Open-source datasets like FLAN, Dolly, and Open Assistant provide instruction-tuning data at various quality levels.

Stage 3: Alignment — RLHF and DPO

SFT produces a model that can follow instructions, but it may still generate harmful, untruthful, or unhelpful outputs. The third stage aligns the model with human preferences.

Reinforcement Learning from Human Feedback (RLHF)

RLHF (Ouyang et al., 2022) involves three substeps:

  1. Collect preference data. For a given prompt, the SFT model generates multiple responses. Human annotators rank them (or compare pairs): response A is better than response B.

  2. Train a reward model. A separate model $R_\phi(x, y)$ is trained to predict human preferences. Given a prompt $x$ and two responses $y_w$ (preferred) and $y_l$ (dispreferred), the training objective is:

$$\mathcal{L}_{\text{RM}} = -\log \sigma\left(R_\phi(x, y_w) - R_\phi(x, y_l)\right)$$

This is the Bradley-Terry model from Chapter 3 — the probability that a human prefers $y_w$ over $y_l$ is modeled as a logistic function of the reward difference.

  1. Optimize the policy with PPO. The LLM is fine-tuned to maximize the reward model's score while staying close to the SFT model (to prevent reward hacking):

$$\mathcal{L}_{\text{RLHF}} = -\mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta}\left[R_\phi(x, y) - \beta \cdot D_{\text{KL}}[\pi_\theta(y \mid x) \| \pi_{\text{SFT}}(y \mid x)]\right]$$

The KL penalty $\beta \cdot D_{\text{KL}}$ is critical: without it, the model would exploit artifacts in the reward model (generating text that scores high with the reward model but is nonsensical to humans). This is the reward hacking problem.

Direct Preference Optimization (DPO)

DPO (Rafailov et al., 2023) simplifies RLHF by eliminating the explicit reward model. The key insight is that the optimal policy under the RLHF objective has a closed-form relationship to the reward function:

$$r^*(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)$$

Substituting this into the Bradley-Terry preference model yields a loss that can be optimized directly on the policy:

$$\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]$$

DPO requires only paired preference data — no reward model, no RL loop, no PPO hyperparameter tuning. In practice, DPO achieves comparable results to RLHF with significantly less infrastructure complexity, which is why it has become the default alignment method for most open-source projects.

The three-stage pipeline is summarized:

Stage Data Objective Scale
Pretraining Trillions of tokens (web, books, code) Next-token prediction Months, thousands of GPUs
SFT 10K-100K curated instruction pairs Next-token prediction (response only) Hours, 8-64 GPUs
RLHF/DPO 10K-100K preference pairs Preference alignment Hours to days, 8-64 GPUs

11.5 Parameter-Efficient Fine-Tuning: LoRA

Full fine-tuning of a 7B-parameter model requires storing and updating all 7 billion parameters — including optimizer states, this can require 80+ GB of GPU memory. For most practitioners, this is impractical. Parameter-efficient fine-tuning (PEFT) methods adapt the model by training only a small number of additional parameters.

The Low-Rank Hypothesis

The key insight behind LoRA (Hu et al., 2022) is that the weight updates during fine-tuning are low-rank. That is, the change in a weight matrix $\Delta W$ during fine-tuning can be well-approximated by a low-rank factorization:

$$\Delta W \approx BA$$

where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times d}$, and $r \ll d$ is the rank. Instead of training the full $d \times d$ matrix $W$ (which has $d^2$ parameters), we train $A$ and $B$ (which have $2dr$ parameters). For $d = 4096$ and $r = 16$: $d^2 = 16.8$M parameters vs. $2dr = 131$K — a reduction of $128\times$.

The LoRA Forward Pass

During fine-tuning, the forward pass for a LoRA-adapted linear layer is:

$$h = W_0 x + \frac{\alpha}{r} \cdot B A x$$

where: - $W_0 \in \mathbb{R}^{d \times d}$ is the pretrained (frozen) weight matrix - $A \in \mathbb{R}^{r \times d}$ is initialized from a random Gaussian - $B \in \mathbb{R}^{d \times r}$ is initialized to zero (so $\Delta W = BA = 0$ at the start of training, and the model begins from the pretrained weights) - $\alpha$ is a scaling hyperparameter that controls the magnitude of the update - The ratio $\alpha / r$ normalizes the contribution so that changing $r$ does not require re-tuning $\alpha$

Why Does Low-Rank Work?

The theoretical justification comes from two observations:

  1. Intrinsic dimensionality. Aghajanyan et al. (2021) showed that fine-tuning objectives have a low intrinsic dimensionality — the loss landscape can be explored effectively in a much lower-dimensional subspace than the full parameter space. For a 350M-parameter model, they found the intrinsic dimension to be as low as ~200.

  2. Spectral structure of $\Delta W$. When you fine-tune a pretrained model on a domain-specific task, most of the model's knowledge is preserved — only a small task-specific adjustment is needed. Hu et al. empirically showed that the singular values of $\Delta W$ decay rapidly, confirming that the update is indeed low-rank.

Implementation with PEFT

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load a pretrained model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                       # Rank of the low-rank matrices
    lora_alpha=32,              # Scaling factor (alpha/r = 2.0)
    lora_dropout=0.05,          # Dropout on LoRA layers
    target_modules=[            # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",      # FFN
    ],
    bias="none",                # Do not train bias terms
)

# Create the PEFT model
peft_model = get_peft_model(model, lora_config)

# Compare parameter counts
total_params = sum(p.numel() for p in peft_model.parameters())
trainable_params = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)
print(f"Total parameters:     {total_params:>14,}")
print(f"Trainable parameters: {trainable_params:>14,}")
print(f"Trainable fraction:   {trainable_params / total_params:.4%}")
Total parameters:  6,738,546,688
Trainable parameters:    159,907,840
Trainable fraction:           2.3735%

QLoRA: LoRA on Quantized Models

QLoRA (Dettmers et al., 2023) combines LoRA with aggressive model quantization, enabling fine-tuning of 65B-parameter models on a single 48GB GPU. The three key innovations:

  1. 4-bit NormalFloat (NF4) quantization. The pretrained weights are quantized to 4 bits using a data type optimized for normally distributed weights. Each parameter uses 4 bits instead of 16 — a $4\times$ memory reduction.

  2. Double quantization. The quantization constants themselves are quantized, saving an additional ~0.4 bits per parameter.

  3. Paged optimizers. Optimizer states are offloaded to CPU memory when GPU memory is exhausted, using unified memory page management.

The workflow is: load the model in 4-bit precision, freeze all quantized weights, add LoRA adapters in fp16/bf16, and train only the LoRA parameters. At inference time, the LoRA weights can be merged back into the base model with no additional latency:

$$W_{\text{merged}} = W_0 + \frac{\alpha}{r} \cdot BA$$

Quantization: INT8, INT4, GPTQ, and AWQ

Quantization reduces the numerical precision of model weights to decrease memory footprint and increase inference speed:

Method Precision Memory (7B model) Notes
fp32 32-bit float ~28 GB Training baseline
fp16/bf16 16-bit float ~14 GB Standard inference
INT8 8-bit integer ~7 GB bitsandbytes LLM.int8()
INT4 (NF4) 4-bit ~3.5 GB QLoRA/bitsandbytes
GPTQ 3-4 bit ~3-4 GB Post-training quantization via second-order info
AWQ 4-bit ~3.5 GB Activation-aware weight quantization

GPTQ (Frantar et al., 2023) uses approximate second-order information (the Hessian of the layer-wise reconstruction error) to find quantization assignments that minimize output degradation. It is a post-training method: quantize once, then deploy.

AWQ (Lin et al., 2024) observes that a small fraction of weights (corresponding to high-magnitude activations) are disproportionately important. It protects these salient weights by scaling them up before quantization and scaling the activations down, achieving better accuracy than naive uniform quantization.


11.6 Retrieval-Augmented Generation (RAG)

LLMs have a fundamental limitation: their knowledge is frozen at training time. They cannot access current information, proprietary data, or domain-specific documents unless that information was in the training corpus. Retrieval-augmented generation (RAG) addresses this by retrieving relevant documents at inference time and injecting them into the prompt.

The RAG Pipeline

A RAG system has four stages:

  1. Indexing. Documents are split into chunks, each chunk is embedded using an embedding model, and the embeddings are stored in a vector database.
  2. Retrieval. Given a user query, embed the query with the same model and retrieve the $k$ most similar chunks via approximate nearest neighbor (ANN) search.
  3. Augmentation. Construct a prompt that includes the retrieved chunks as context.
  4. Generation. The LLM generates a response grounded in the retrieved context.

Chunking Strategies

How you split documents into chunks significantly affects retrieval quality:

  • Fixed-size chunking. Split every $n$ tokens (e.g., 512 tokens) with overlap (e.g., 50 tokens). Simple but may split sentences or paragraphs mid-thought.
  • Recursive character splitting. Split on paragraph boundaries, then sentence boundaries, then character counts as a fallback. Preserves semantic units.
  • Semantic chunking. Use an embedding model to detect topic boundaries — split when the embedding similarity between adjacent sentences drops below a threshold.

The chunk size trade-off: smaller chunks are more precise (the retrieved chunk is more likely to contain exactly the needed information) but lose broader context. Larger chunks provide more context but may contain irrelevant information that dilutes the signal.

Implementation: A Complete RAG Pipeline

import numpy as np
from typing import List, Dict, Tuple
from dataclasses import dataclass, field


@dataclass
class Document:
    """A document with text content and metadata."""
    text: str
    metadata: Dict[str, str] = field(default_factory=dict)
    doc_id: str = ""


@dataclass
class Chunk:
    """A chunk of a document with its embedding."""
    text: str
    doc_id: str
    chunk_id: int
    metadata: Dict[str, str] = field(default_factory=dict)
    embedding: np.ndarray = field(default_factory=lambda: np.array([]))


def chunk_documents(
    documents: List[Document],
    chunk_size: int = 512,
    chunk_overlap: int = 50,
) -> List[Chunk]:
    """Split documents into overlapping chunks.

    Uses a simple token-count approach. In production, prefer
    recursive splitting that respects sentence and paragraph
    boundaries (e.g., LangChain's RecursiveCharacterTextSplitter).

    Args:
        documents: List of documents to chunk.
        chunk_size: Target chunk size in tokens (approximated by words).
        chunk_overlap: Number of overlapping tokens between chunks.

    Returns:
        List of Chunk objects.
    """
    chunks = []
    for doc in documents:
        words = doc.text.split()
        start = 0
        chunk_id = 0
        while start < len(words):
            end = min(start + chunk_size, len(words))
            chunk_text = " ".join(words[start:end])
            chunks.append(Chunk(
                text=chunk_text,
                doc_id=doc.doc_id,
                chunk_id=chunk_id,
                metadata=doc.metadata,
            ))
            chunk_id += 1
            start += chunk_size - chunk_overlap
    return chunks


class SimpleEmbedder:
    """Embedding model wrapper.

    In production, use a dedicated embedding model like:
    - sentence-transformers/all-MiniLM-L6-v2 (fast, 384-dim)
    - BAAI/bge-large-en-v1.5 (accurate, 1024-dim)
    - OpenAI text-embedding-3-small (API-based, 1536-dim)

    This demo uses random projections to illustrate the pipeline.
    """

    def __init__(self, dim: int = 384, seed: int = 42) -> None:
        self.dim = dim
        self.rng = np.random.RandomState(seed)
        # Simulate: project bag-of-words to dense space
        self.vocab: Dict[str, int] = {}
        self.projection: np.ndarray = np.array([])

    def _build_vocab(self, texts: List[str]) -> None:
        """Build vocabulary from corpus."""
        for text in texts:
            for word in text.lower().split():
                if word not in self.vocab:
                    self.vocab[word] = len(self.vocab)
        vocab_size = len(self.vocab)
        self.projection = self.rng.randn(vocab_size, self.dim) / np.sqrt(self.dim)

    def encode(self, texts: List[str]) -> np.ndarray:
        """Encode texts to dense vectors via random projection of BoW.

        Args:
            texts: List of text strings.

        Returns:
            Embeddings array of shape (len(texts), dim).
        """
        if len(self.vocab) == 0:
            self._build_vocab(texts)

        embeddings = np.zeros((len(texts), self.dim))
        for i, text in enumerate(texts):
            bow = np.zeros(len(self.vocab))
            for word in text.lower().split():
                if word in self.vocab:
                    bow[self.vocab[word]] += 1
            if np.sum(bow) > 0:
                bow = bow / np.sum(bow)  # Normalize
            embeddings[i] = bow @ self.projection

        # L2-normalize for cosine similarity
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        norms = np.maximum(norms, 1e-8)
        embeddings = embeddings / norms
        return embeddings


class VectorStore:
    """Simple in-memory vector store using brute-force cosine similarity.

    In production, use a dedicated vector database:
    - FAISS (Meta) — local, fast, supports IVF/HNSW indexing
    - ChromaDB — lightweight, Python-native
    - Pinecone — managed service, scales automatically
    - Weaviate — open-source, hybrid search (vector + keyword)

    Args:
        dim: Embedding dimension.
    """

    def __init__(self, dim: int = 384) -> None:
        self.dim = dim
        self.embeddings: np.ndarray = np.array([]).reshape(0, dim)
        self.chunks: List[Chunk] = []

    def add(self, chunks: List[Chunk], embeddings: np.ndarray) -> None:
        """Add chunks and their embeddings to the store.

        Args:
            chunks: List of Chunk objects.
            embeddings: Embeddings array of shape (n_chunks, dim).
        """
        if self.embeddings.shape[0] == 0:
            self.embeddings = embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, embeddings])
        self.chunks.extend(chunks)

    def search(self, query_embedding: np.ndarray, top_k: int = 5) -> List[Tuple[Chunk, float]]:
        """Retrieve the top-k most similar chunks.

        Args:
            query_embedding: Query vector of shape (dim,).
            top_k: Number of results to return.

        Returns:
            List of (Chunk, similarity_score) tuples.
        """
        similarities = self.embeddings @ query_embedding
        top_indices = np.argsort(similarities)[::-1][:top_k]
        return [(self.chunks[i], float(similarities[i])) for i in top_indices]


def build_rag_prompt(
    query: str,
    retrieved_chunks: List[Tuple[Chunk, float]],
    system_prompt: str = "You are a helpful assistant. Answer the question based on the provided context. If the context does not contain enough information, say so.",
) -> str:
    """Construct a RAG prompt from query and retrieved context.

    Args:
        query: User's natural language question.
        retrieved_chunks: Retrieved (chunk, score) pairs.
        system_prompt: System-level instructions.

    Returns:
        Formatted prompt string.
    """
    context_parts = []
    for i, (chunk, score) in enumerate(retrieved_chunks, 1):
        source = chunk.metadata.get("source", "unknown")
        context_parts.append(
            f"[Source {i}: {source} (relevance: {score:.3f})]\n{chunk.text}"
        )
    context = "\n\n".join(context_parts)

    prompt = f"""{system_prompt}

Context:
{context}

Question: {query}

Answer:"""
    return prompt


# --- Demo: RAG Pipeline End-to-End ---

# 1. Create documents (simulating a StreamRec content catalog)
documents = [
    Document(
        text="Planet Earth III is a nature documentary series narrated by "
             "David Attenborough. Released in 2023, it explores diverse "
             "ecosystems including volcanic islands, freshwater habitats, "
             "forests, and ocean depths. The series uses cutting-edge camera "
             "technology including drones and deep-sea submersibles to capture "
             "never-before-seen animal behaviors.",
        metadata={"source": "catalog", "category": "documentary", "year": "2023"},
        doc_id="doc_001",
    ),
    Document(
        text="Chasing Ice is a 2012 documentary following photographer James "
             "Balog and his Extreme Ice Survey. The film documents retreating "
             "glaciers through time-lapse photography, providing visual evidence "
             "of climate change. It won the Excellence in Cinematography Award "
             "at Sundance.",
        metadata={"source": "catalog", "category": "documentary", "year": "2012"},
        doc_id="doc_002",
    ),
    Document(
        text="The social network comedy Silicon Valley follows programmer "
             "Richard Hendricks and his startup Pied Piper as they navigate "
             "the tech industry. Created by Mike Judge, the series ran for "
             "six seasons on HBO from 2014 to 2019.",
        metadata={"source": "catalog", "category": "comedy", "year": "2014"},
        doc_id="doc_003",
    ),
    Document(
        text="An Inconvenient Truth is a 2006 documentary featuring Al Gore's "
             "campaign to educate people about global warming. The film presents "
             "scientific data on climate change and its projected consequences, "
             "winning the Academy Award for Best Documentary Feature.",
        metadata={"source": "catalog", "category": "documentary", "year": "2006"},
        doc_id="doc_004",
    ),
]

# 2. Chunk and embed
chunks = chunk_documents(documents, chunk_size=100, chunk_overlap=20)
embedder = SimpleEmbedder(dim=128)
all_texts = [c.text for c in chunks]
embeddings = embedder.encode(all_texts)

# 3. Build vector store
store = VectorStore(dim=128)
store.add(chunks, embeddings)

# 4. Query
query = "documentaries about climate change"
query_embedding = embedder.encode([query])[0]
results = store.search(query_embedding, top_k=3)

# 5. Build prompt
prompt = build_rag_prompt(query, results)
print("=== RAG Prompt ===")
print(prompt[:800])
print("...")
print(f"\nRetrieved {len(results)} chunks")
for chunk, score in results:
    print(f"  [{score:.3f}] {chunk.doc_id}: {chunk.text[:60]}...")
=== RAG Prompt ===
You are a helpful assistant. Answer the question based on the provided context. If the context does not contain enough information, say so.

Context:
[Source 1: catalog (relevance: 0.412)]
An Inconvenient Truth is a 2006 documentary featuring Al Gore's campaign to educate people about global warming...

[Source 2: catalog (relevance: 0.387)]
Chasing Ice is a 2012 documentary following photographer James Balog and his Extreme Ice Survey...

[Source 3: catalog (relevance: 0.201)]
Planet Earth III is a nature documentary series narrated by David Attenborough...

Question: documentaries about climate change

Answer:
...

Retrieved 3 chunks
  [0.412] doc_004: An Inconvenient Truth is a 2006 documentary featuring Al Go...
  [0.387] doc_002: Chasing Ice is a 2012 documentary following photographer Ja...
  [0.201] doc_001: Planet Earth III is a nature documentary series narrated by ...

RAG Design Decisions

Building a production RAG system requires navigating several design trade-offs:

Embedding model selection. The embedding model determines retrieval quality. Dedicated embedding models (BGE, GTE, E5) consistently outperform using LLM embeddings directly, because they are trained specifically for the retrieval objective (contrastive learning with hard negatives). The MTEB leaderboard provides standardized comparisons.

Chunk size. Empirically, chunks of 256-512 tokens work well for most applications. Smaller chunks (128 tokens) improve precision but lose paragraph-level context. Larger chunks (1024+ tokens) provide more context but may dilute relevance.

Retrieval count ($k$). More retrieved chunks provide more context but consume more of the context window and increase the chance of including irrelevant information. A common pattern is to retrieve $k=10-20$ chunks, then rerank with a cross-encoder to select the top 3-5.

Hybrid search. Combining dense retrieval (vector similarity) with sparse retrieval (BM25/keyword matching) often outperforms either alone. Dense retrieval captures semantic similarity; sparse retrieval captures exact keyword matches that embeddings may miss.

Production ML = Software Engineering: A production RAG system is not just a pipeline — it is a distributed system with failure modes. The vector database can return stale embeddings if the indexing pipeline lags behind document updates. The embedding model and the LLM can drift out of alignment if one is updated and the other is not. Chunk boundaries can split critical information across two chunks, neither of which is retrieved. These are engineering problems, not ML problems, and they require engineering solutions: monitoring, testing, and graceful degradation.


11.7 Prompt Engineering and Decoding Strategies

Prompt Engineering

The way you structure the input to an LLM dramatically affects output quality. Three foundational techniques:

Zero-shot prompting. Describe the task directly: "Classify the following review as positive or negative: ..." The model relies entirely on its pretrained knowledge.

Few-shot prompting. Provide examples in the prompt:

Review: "This movie was fantastic, I loved every moment." → Positive
Review: "Terrible acting and a boring plot." → Negative
Review: "The cinematography was breathtaking." → ?

Few-shot prompting leverages in-context learning — the model infers the pattern from the examples without any parameter updates. Brown et al. (2020) showed that GPT-3's few-shot performance often matches fine-tuned models on standard benchmarks.

Chain-of-thought (CoT) prompting. Instruct the model to reason step by step before producing an answer: "Let's think step by step." Wei et al. (2022) showed that CoT dramatically improves performance on arithmetic, commonsense reasoning, and symbolic reasoning tasks. The mechanism is straightforward: intermediate reasoning tokens allow the model to decompose complex problems and perform sequential computation that would be impossible in a single forward pass.

Decoding Strategies

How tokens are sampled from the model's predicted distribution:

Greedy decoding. Always select the highest-probability token: $x_t = \arg\max P_\theta(x \mid x_{

Temperature scaling. Divide logits by temperature $\tau$ before softmax:

$$P(x_t = v) = \frac{\exp(z_v / \tau)}{\sum_{v'} \exp(z_{v'} / \tau)}$$

$\tau < 1$ sharpens the distribution (more deterministic); $\tau > 1$ flattens it (more diverse). $\tau \to 0$ recovers greedy decoding.

Top-$k$ sampling. Sample only from the $k$ highest-probability tokens, redistributing probability mass among them. $k = 1$ is greedy; $k = V$ is unrestricted sampling.

Top-$p$ (nucleus) sampling (Holtzman et al., 2020). Sample from the smallest set of tokens whose cumulative probability exceeds $p$:

$$\text{Top-}p = \min \{S \subseteq \mathcal{V} : \sum_{v \in S} P(v) \geq p\}$$

This adapts the number of candidates to the model's confidence: when the model is confident (one token has probability 0.95), nucleus sampling effectively becomes greedy. When the model is uncertain (probability spread across many tokens), it allows more diversity.


11.8 Hallucination: A Fundamental Property

LLMs hallucinate — they generate text that is fluent, confident, and wrong. This is not a bug to be patched; it is a fundamental consequence of the training objective.

Why LLMs Hallucinate

A language model is trained to produce plausible continuations of text. "Plausible" and "true" are different properties. Consider: the model has learned that sentences like "The capital of [country] is [city]" are common patterns. When asked about a real country, it retrieves the correct city from its parameters. When asked about a fictional country, it generates a plausible-sounding city — because the distributional pattern is the same.

More precisely, the model minimizes:

$$\mathcal{L} = -\log P_\theta(x_t \mid x_{

This objective rewards high probability for the training data, not truthfulness. If the training data contains conflicting information (which internet text inevitably does), the model learns a distribution over possible continuations that may assign nonzero probability to false statements.

Mitigating Hallucination

No method eliminates hallucination, but several reduce it:

  1. RAG. Grounding responses in retrieved documents reduces hallucination by providing factual context. But the model can still hallucinate details not in the context, or misinterpret the context.
  2. Fine-tuning on verified data. Training on curated, fact-checked data reduces but does not eliminate hallucination.
  3. Self-consistency. Generate multiple responses and check for agreement. Inconsistent answers are more likely to be hallucinations.
  4. Citation and attribution. Require the model to cite sources, then verify the citations. This shifts hallucination from "silent" to "auditable."
  5. Confidence calibration. Use the model's token-level probabilities to flag low-confidence generations. However, LLMs are often overconfident on incorrect answers.

Fundamentals > Frontier: Hallucination is a special case of a general principle: generative models produce samples from the learned distribution, and the learned distribution does not perfectly match reality. This is true of VAEs (Chapter 12), GANs, and diffusion models. The difference is that LLM hallucinations are expressed in natural language, which makes them uniquely dangerous — humans are poorly calibrated at detecting fluent falsehoods.


11.9 Evaluating LLM Outputs

Evaluating LLM outputs is one of the hardest unsolved problems in NLP. Unlike classification (where accuracy is well-defined) or regression (where MSE is clear), evaluating the quality of free-form text requires judgment.

Automated Metrics

BLEU (Bilingual Evaluation Understudy). Measures n-gram precision between the generated text and reference text(s):

$$\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$$

where $p_n$ is the modified $n$-gram precision and BP is a brevity penalty for short outputs. BLEU was designed for machine translation and correlates poorly with human judgments for open-ended generation.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation). Measures $n$-gram recall between generated and reference text. ROUGE-L uses longest common subsequence. Better suited for summarization than BLEU, but still limited by the need for reference texts.

BERTScore (Zhang et al., 2020). Computes the similarity between generated and reference text using contextual embeddings from BERT. Each token in the candidate is matched to the most similar token in the reference (and vice versa), producing precision, recall, and F1 scores in embedding space. BERTScore correlates better with human judgments than BLEU or ROUGE because it captures semantic similarity rather than exact n-gram matches.

LLM-as-Judge

A more recent and increasingly popular approach: use a strong LLM (e.g., GPT-4) to evaluate the outputs of another LLM. The judge receives a prompt, the generated response, and optionally a reference answer, and provides a structured evaluation.

def llm_judge_prompt(question: str, response: str, reference: str = "") -> str:
    """Construct a prompt for LLM-as-judge evaluation.

    This prompt template evaluates response quality on four dimensions:
    relevance, accuracy, completeness, and clarity.

    Args:
        question: The original question.
        response: The model's response to evaluate.
        reference: Optional reference answer for comparison.

    Returns:
        Formatted judge prompt.
    """
    ref_section = ""
    if reference:
        ref_section = f"""
Reference Answer:
{reference}
"""

    return f"""You are an expert evaluator. Rate the following response on a scale of 1-5 for each criterion.

Question:
{question}
{ref_section}
Response to Evaluate:
{response}

Evaluate on these criteria:
1. **Relevance** (1-5): Does the response address the question?
2. **Accuracy** (1-5): Is the information factually correct?
3. **Completeness** (1-5): Does it cover the key aspects?
4. **Clarity** (1-5): Is it well-organized and easy to understand?

Provide your ratings in the format:
Relevance: X/5
Accuracy: X/5
Completeness: X/5
Clarity: X/5
Overall: X/5
Reasoning: [brief explanation]"""

LLM-as-judge has known biases: models tend to prefer their own outputs (self-preference bias), longer responses (verbosity bias), and responses that appear first in a comparison (position bias). Mitigations include swapping position order, using different judge models, and calibrating with human annotations.

A Practical Evaluation Framework

For production LLM systems, use a layered evaluation approach:

Layer Method What It Catches
Unit tests Deterministic assertions on known inputs Regressions, format violations
Automated metrics BLEU, ROUGE, BERTScore against references Quality drift
LLM-as-judge Structured scoring by a strong model Nuanced quality issues
Human evaluation Expert annotation on a sample Ground truth quality

The key insight: no single evaluation method is sufficient. Use all four layers, with human evaluation as the gold standard for calibrating the automated methods.


11.10 LLMs in the StreamRec Content Platform

StreamRec's content catalog contains 200,000 items (articles, videos, podcasts) with titles, descriptions, categories, and metadata. Users want to search using natural language: "show me documentaries about climate change from the last year" or "find comedies similar to Silicon Valley."

The current keyword-based search fails on semantic queries — searching for "climate change" misses documentaries about "global warming" or "rising sea levels." A RAG-based system can bridge this gap.

Architecture:

User Query
    │
    ▼
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  Embedding   │────▶│  Vector DB   │────▶│  Retrieval   │
│   Model      │     │  (FAISS)     │     │  (Top-K)     │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                 │
                                                 ▼
                                          ┌─────────────┐
                                          │  Reranker    │
                                          │  (Cross-Enc) │
                                          └──────┬──────┘
                                                 │
                                                 ▼
                                          ┌─────────────┐     ┌─────────────┐
                                          │  LLM         │────▶│  Response    │
                                          │  (Summary)   │     │  to User     │
                                          └─────────────┘     └─────────────┘

The progressive project milestone for this chapter: implement the embedding, indexing, retrieval, and prompt construction stages using the code from Section 11.6 as a starting point. The LLM generation step can use any API-accessible model. Evaluate retrieval quality using precision@$k$ and NDCG against manually labeled relevance judgments.

Pharma Application: Clinical Note Extraction

MediCore Pharmaceuticals (introduced in Chapter 3) is exploring LLMs for extracting structured data from unstructured clinical notes. A typical clinical note reads:

"Patient presents with persistent cough x 3 weeks, low-grade fever (100.2F), and fatigue. CXR shows bilateral infiltrates. Started on azithromycin 500mg day 1, then 250mg days 2-5. Follow-up in 1 week."

The goal is to extract structured fields: symptoms, vital signs, imaging findings, medications (with dosages), and follow-up plans. Traditional NLP approaches (rule-based NER, custom BERT models) require extensive labeled data and domain expertise. An LLM with few-shot prompting can extract this information with minimal setup — but with significant risks:

  1. Hallucinated details. The LLM might infer a diagnosis not stated in the note.
  2. Missed information. Subtle clinical details (negative findings, implicit dosing changes) may be overlooked.
  3. Inconsistent extraction. The same note, processed twice, might yield different structured outputs.

These risks make LLM-based extraction promising for research acceleration but dangerous for clinical decision-making without human verification. The evaluation framework from Section 11.9 is essential: compare LLM extractions against expert annotations on a held-out set, measure precision and recall per field, and flag cases where the LLM's confidence is low.


11.11 Chapter Summary

Large language models are decoder-only transformers scaled to billions of parameters and trained on trillions of tokens. The architecture is the same transformer from Chapter 10 — causal self-attention, feed-forward networks, residual connections — but scale creates emergent capabilities that no architectural innovation produces at small scale.

The three-stage training pipeline — pretraining on raw text, supervised fine-tuning on instruction data, and alignment via RLHF or DPO — transforms a text completion engine into a useful assistant. Scaling laws (Chinchilla) provide principled guidance on how to allocate compute between model size and data size.

LoRA and QLoRA make fine-tuning accessible: by training only low-rank perturbations to frozen pretrained weights, practitioners can adapt a 7B model on a single GPU. RAG extends the model's knowledge beyond its training data by retrieving relevant documents at inference time.

The honest assessment: LLMs are powerful but unreliable. They hallucinate. Their outputs are sensitive to prompt wording. Their evaluation is an unsolved problem. These limitations are not temporary — they are consequences of the training objective itself. Production systems must be designed with these limitations in mind: RAG for grounding, evaluation pipelines for quality, human oversight for high-stakes decisions.

The next chapter introduces generative models — VAEs, GANs, and diffusion models — which share the same fundamental challenge: learning to generate samples from a complex distribution. The hallucination problem in LLMs is a special case of this broader challenge.


Key Terms Introduced

Term Definition
Language model A model that assigns probabilities to sequences of tokens
Autoregressive generation Generating tokens one at a time, each conditioned on all previous tokens
Next-token prediction The training objective: predict the next token given the preceding context
Perplexity $\exp(-\frac{1}{T}\sum_t \log P(x_t \mid x_{
Decoder-only transformer Transformer architecture using only causal self-attention (no encoder)
GPT architecture Decoder-only transformer with pre-norm, SwiGLU FFN, RoPE
Pretraining Training on massive unlabeled text with next-token prediction
Supervised fine-tuning (SFT) Training on curated (prompt, response) pairs to follow instructions
RLHF Reinforcement learning from human feedback — using a reward model trained on human preferences
DPO Direct preference optimization — aligning the model directly on preference data without a reward model
PEFT Parameter-efficient fine-tuning — adapting a model by training a small fraction of parameters
LoRA Low-Rank Adaptation — training low-rank matrices $BA$ as additive weight updates
QLoRA LoRA applied to 4-bit quantized models, enabling fine-tuning on consumer hardware
Quantization Reducing weight precision (fp16 $\to$ INT8/INT4) to decrease memory and increase speed
GPTQ Post-training quantization using second-order weight optimization
AWQ Activation-aware weight quantization — protecting salient weights during quantization
RAG Retrieval-augmented generation — retrieving documents to augment the LLM's context
Vector database A database optimized for storing and searching dense embeddings
Chunking strategy How documents are split into segments for embedding and retrieval
Prompt engineering Designing input prompts to elicit desired model behavior
Few-shot learning Providing examples in the prompt to guide model behavior without training
Chain-of-thought Prompting the model to reason step-by-step before answering
Tokenizer (BPE) Byte-pair encoding — building a vocabulary by iteratively merging frequent token pairs
SentencePiece Language-agnostic tokenizer that operates on raw byte streams
Context window The maximum number of tokens the model can process in a single forward pass
Hallucination Generating fluent, confident text that is factually incorrect
BLEU N-gram precision metric for evaluating generated text against references
ROUGE N-gram recall metric, commonly used for summarization evaluation
BERTScore Embedding-based similarity metric using contextual representations
LLM-as-judge Using a strong LLM to evaluate the outputs of another model