Chapter 11: Further Reading

Essential Sources

1. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, et al., "Training Compute-Optimal Large Language Models" (NeurIPS, 2022) — the Chinchilla paper

This paper fundamentally changed how the field thinks about scaling LLMs. By training over 400 models ranging from 70M to 16B parameters on varying amounts of data, Hoffmann et al. derived scaling laws showing that model size and training data should be scaled equally for a given compute budget. The key result: the optimal ratio is approximately 20 tokens per parameter. This meant that many existing models — including Gopher (280B parameters, 300B tokens) — were severely undertrained. Chinchilla (70B parameters, 1.4T tokens) matched Gopher's performance at 4x fewer parameters, demonstrating that smaller, well-trained models are both better and cheaper to deploy.

Reading guidance: Start with Section 1 (introduction) for the core insight, then Section 3 for the three complementary approaches to estimating optimal allocation (fitting parametric loss functions, fixing model size and varying data, fitting directly on the loss). Table 3 is the critical result: the predicted optimal model size and data size for various compute budgets. Section 4 presents the Chinchilla training and evaluation, confirming the predictions. The appendix contains the raw results of all 400+ training runs — invaluable for anyone designing their own scaling experiments. Note that subsequent work (LLaMA 3, Gemma) has moved beyond Chinchilla optimality by deliberately "over-training" smaller models to reduce inference cost at the expense of training efficiency. The scaling laws remain valid; the objective function has shifted from "minimize loss for fixed compute" to "minimize total cost of ownership."

2. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, "LoRA: Low-Rank Adaptation of Large Language Models" (ICLR, 2022)

The paper that made LLM fine-tuning practical for most practitioners. Hu et al. observed that the weight updates during fine-tuning have low intrinsic rank and proposed training only low-rank matrices $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ that are added to frozen pretrained weights. With $r = 4$-$8$, LoRA matches full fine-tuning performance while training 10,000x fewer parameters. The method is elegant in its simplicity: the LoRA weights can be merged into the base model at inference time ($W_{\text{merged}} = W_0 + \frac{\alpha}{r}BA$), incurring zero additional latency. Multiple LoRA adapters can be trained for different tasks and swapped at serving time, enabling multi-task deployment from a single base model.

Reading guidance: Section 2 (problem statement) and Section 4 (the method) are the core contributions. The initialization strategy is important: $A$ is Gaussian random, $B = 0$, so $\Delta W = 0$ at the start of training and the model begins from the pretrained weights. Section 5 presents ablations on which layers to adapt (attention projections are most effective) and the effect of rank $r$ (diminishing returns above $r = 8$ for most tasks). Section 7 analyzes the learned $\Delta W$ and confirms that it is indeed low-rank, providing empirical support for the method's premise. For the quantized variant, read Dettmers et al., "QLoRA: Efficient Finetuning of Quantized Language Models" (NeurIPS, 2023), which combines LoRA with 4-bit NormalFloat quantization and paged optimizers to fine-tune 65B models on a single 48GB GPU.

3. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (NeurIPS, 2020)

The foundational paper for RAG. Lewis et al. proposed combining a pretrained retrieval model (DPR — Dense Passage Retrieval) with a pretrained sequence-to-sequence generator (BART), training the two components end-to-end. The retriever uses dense embeddings to find relevant passages from a large corpus (Wikipedia), and the generator conditions on these passages to produce the output. Two variants are presented: RAG-Sequence (same passages used for the entire output) and RAG-Token (different passages can be selected for each output token).

Reading guidance: Section 3 describes the RAG architecture: the retriever $p_\eta(z \mid x)$ produces a distribution over documents, and the generator $p_\theta(y_i \mid x, z, y_{1:i-1})$ produces output tokens conditioned on the input and retrieved documents. The marginalization over documents (Equation 2 for RAG-Sequence, Equation 4 for RAG-Token) is the key technical contribution. Section 4.2 shows that RAG outperforms pure parametric models on knowledge-intensive tasks (open-domain QA, fact verification) and generates more factual text. Note that modern RAG implementations differ significantly from this paper: most use frozen retrievers and frozen generators connected by prompt engineering rather than end-to-end training. The core insight — grounding generation in retrieved evidence — remains the foundation.

4. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn, "Direct Preference Optimization: Your Language Model Is Secretly a Reward Model" (NeurIPS, 2023)

This paper simplifies the RLHF pipeline by eliminating the explicit reward model and the RL optimization loop. The key theoretical insight is that the optimal policy under the KL-constrained RLHF objective can be expressed in closed form as a function of the reward, and substituting this into the Bradley-Terry preference model yields a supervised loss that depends only on the policy model. The resulting DPO loss requires only paired preference data (preferred vs. dispreferred responses) and standard supervised training — no reward model training, no PPO, no RL infrastructure.

Reading guidance: Section 3 is the mathematical core: Equation 4 derives the optimal policy, and Equation 7 derives the DPO loss. The derivation is clean and follows from standard optimization under KL constraints — readers with background from Chapter 3 (probability) and Chapter 4 (information theory, KL divergence) will find it accessible. Section 5 presents experiments showing DPO matches or exceeds RLHF performance across summarization and dialogue tasks. The practical impact is significant: DPO reduces the RLHF pipeline from three interacting components (reward model, value function, policy with PPO) to a single supervised training loop. For the broader context of alignment methods, see Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback" (NeurIPS, 2022) — the InstructGPT paper that established the RLHF pipeline.

5. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (NeurIPS, 2022)

This paper demonstrated that prompting LLMs to produce intermediate reasoning steps ("Let's think step by step") dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks. The improvement is particularly large for tasks that require multi-step reasoning: on the GSM8K math benchmark, chain-of-thought prompting improved PaLM 540B's accuracy from 56.9% to 74.4%. The mechanism is straightforward: intermediate tokens allow the model to decompose complex problems and use its context window as a working memory for sequential computation.

Reading guidance: Section 2 presents the core idea with examples. The key experiments are in Section 3: chain-of-thought helps only for sufficiently large models (roughly 100B+ parameters), suggesting that smaller models lack the capacity for coherent multi-step reasoning. Table 1 and Figure 4 show the scaling behavior: the benefit of chain-of-thought increases with model size. Section 5 discusses failure modes, including cases where the chain of reasoning is correct but the final answer is wrong, and cases where the reasoning is plausible but logically invalid. For the zero-shot variant, see Kojima et al., "Large Language Models Are Zero-Shot Reasoners" (NeurIPS, 2022), which showed that simply appending "Let's think step by step" (without few-shot examples) also improves reasoning.