> "The bitter lesson is that general methods that leverage computation are ultimately the most effective."
Learning Objectives
- State and interpret the neural scaling laws discovered by Kaplan et al. and the Chinchilla revision by Hoffmann et al.
- Compute the optimal allocation of a fixed compute budget between model size and training tokens
- Define emergent abilities and discuss the debate around whether they are genuine phase transitions or measurement artifacts
- Compare the architectures, training strategies, and capabilities of major LLM families including GPT-4, Claude, Llama, Mistral, and Gemini
- Select and interpret standard LLM benchmarks such as MMLU, HumanEval, GSM8K, and HELM
- Explain the instruction-following paradigm and how it differs from raw pre-training
- Analyze how tokenizer design affects model performance, multilinguality, and effective context length
- Trace the evolution of context windows from 512 tokens to millions of tokens and understand the enabling techniques
- Describe the mixture-of-experts (MoE) architecture and explain why it decouples parameter count from compute cost
In This Chapter
- 22.1 Introduction: The Era of Scale
- 22.2 Neural Scaling Laws
- 22.3 Chinchilla and Compute-Optimal Training
- 22.4 Emergent Abilities
- 22.5 Major LLM Families
- 22.6 Benchmarking Large Language Models
- 22.7 The Instruction-Following Paradigm
- 22.8 Tokenizer Effects
- 22.9 Context Window Evolution
- 22.10 Mixture of Experts (MoE)
- 22.11 Putting It All Together: The Modern LLM Pipeline
- 22.12 The Cost of Scale
- 22.13 Summary
- References
Chapter 22: Scaling Laws and Large Language Models
Part IV: Attention, Transformers, and Language Models
"The bitter lesson is that general methods that leverage computation are ultimately the most effective." --- Rich Sutton, The Bitter Lesson (2019)
22.1 Introduction: The Era of Scale
Chapter 21 introduced decoder-only Transformer models and the autoregressive next-token prediction objective. We saw how relatively simple architectural choices---causal masking, learned positional embeddings, pre-norm layer normalization---produce models capable of generating coherent text. But something remarkable happens when we take those same architectural choices and scale them up by orders of magnitude: the models do not merely produce better text, they acquire qualitatively new capabilities.
Between 2018 and 2024, the largest language models grew from roughly 100 million parameters (GPT-1) to trillions of parameters (GPT-4, Gemini Ultra). Training compute budgets expanded from petaflop-days to hundreds of thousands of petaflop-days. This scaling was not accidental---it was driven by a series of empirical discoveries showing that model performance follows remarkably predictable power-law relationships with model size, dataset size, and compute budget.
Understanding these scaling laws is essential for modern AI engineering. They tell us how to allocate resources efficiently, predict future capabilities, and reason about when scaling is likely to help versus when architectural innovation is needed. In this chapter, we will study the original scaling laws discovered by Kaplan et al. (2020), the crucial revision introduced by Hoffmann et al. (2022) with the Chinchilla model, and the ongoing debate about emergent abilities.
We will then survey the major LLM families that have emerged from this scaling paradigm---GPT-4, Claude, Llama, Mistral, and Gemini---examining how each balances model size, training data, architectural innovations, and post-training refinements. We will learn how to benchmark these models systematically, understand the instruction-following paradigm that makes them usable, and explore two critical technical dimensions: tokenizer design and context window evolution. The chapter concludes with a deep treatment of Mixture of Experts (MoE), a technique that decouples parameter count from per-token compute cost and has become central to modern LLM design.
22.2 Neural Scaling Laws
22.2.1 The Empirical Discovery
In January 2020, a team at OpenAI led by Jared Kaplan published "Scaling Laws for Neural Language Models," one of the most influential papers in modern AI. The central finding was striking in its simplicity: language model performance, measured by cross-entropy loss $L$ on held-out text, follows smooth power-law relationships with three key variables.
Model size. When training on sufficiently large data, the loss decreases as a power law in the number of non-embedding parameters $N$:
$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}$$
where $\alpha_N \approx 0.076$ and $N_c$ is a constant. This means that each 10x increase in parameters yields a roughly constant reduction in loss.
Dataset size. When the model is sufficiently large, the loss decreases as a power law in the number of training tokens $D$:
$$L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}$$
where $\alpha_D \approx 0.095$. Note that $\alpha_D > \alpha_N$, which Kaplan et al. interpreted as evidence that scaling model size matters more than scaling data.
Compute budget. When both model size and data are scaled optimally for a given compute budget $C$ (measured in FLOPs), the loss follows:
$$L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}$$
where $\alpha_C \approx 0.050$.
22.2.2 The Power-Law Nature
Power laws appear throughout physics and complex systems, but their appearance in deep learning was initially surprising. A power law has the general form $y = ax^{-b}$, which appears as a straight line on a log-log plot. The Kaplan scaling laws hold over many orders of magnitude---from millions to billions of parameters, from millions to hundreds of billions of tokens, from petaflops to thousands of petaflops.
Several properties of power laws are worth noting:
-
Diminishing returns: Each additional order of magnitude yields the same absolute improvement in log-loss, meaning the marginal benefit of scaling decreases in absolute terms.
-
Predictability: Because the relationship is smooth and monotonic, one can fit a scaling curve on small-scale experiments and extrapolate to larger scales. This was used extensively for planning GPT-4's training.
-
Universality: Similar power-law exponents appear across different model architectures (Transformers, LSTMs) and different data domains (text, images, code, math), suggesting that the phenomenon is fundamental rather than architecture-specific.
-
Irreducible loss: Extrapolating the scaling curves suggests an irreducible loss $L_\infty$ that cannot be overcome by scaling alone, representing the inherent entropy of natural language.
The combined scaling law that accounts for both model size and data simultaneously is:
$$L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}$$
This bivariate form reveals a critical insight: there is an optimal ratio between $N$ and $D$ for any fixed compute budget.
22.2.3 Kaplan's Compute-Optimal Allocation
Kaplan et al. derived how to split a compute budget $C$ between model size $N$ and training tokens $D$. Under the approximation $C \approx 6ND$ (each token requires roughly 6 floating-point operations per parameter in a forward-backward pass), they found that the loss-optimal allocation scales as:
$$N_{\text{opt}} \propto C^{0.73}, \quad D_{\text{opt}} \propto C^{0.27}$$
This prescription says that when you get more compute, you should invest most of it into making the model larger, with only modest increases in training data. Under this regime, a model with 10x more compute should be roughly 5x larger but trained on only 2x more data.
This recommendation had an enormous practical impact. It influenced the training of GPT-3, which used 175 billion parameters trained on 300 billion tokens---a ratio of roughly 1.7 tokens per parameter. By Kaplan's logic, this was near-optimal.
But the prescription was wrong.
22.2.4 Why Were the Kaplan Laws Wrong?
The error in Kaplan's compute-optimal allocation stemmed from their experimental methodology. They trained models of different sizes for a fixed number of training steps (not tokens). Because larger models process more FLOPs per step but the same number of tokens per step, this approach systematically undertrained the smaller models relative to the larger ones. The fitted scaling law thus overestimated the benefit of model size and underestimated the benefit of more data.
Additionally, Kaplan used a learning rate schedule that was not independently optimized for each model size. Larger models tend to benefit from lower learning rates and longer warmup periods. Using a shared schedule may have made the larger models appear relatively better than they would with individually tuned hyperparameters.
These methodological subtleties illustrate an important lesson: scaling laws are empirical observations that depend on the experimental setup. Changing how you train (learning rate, schedule, batch size, data ordering) can shift the optimal allocation. This is why multiple groups have found somewhat different scaling exponents.
22.3 Chinchilla and Compute-Optimal Training
22.3.1 The Chinchilla Revision
In March 2022, a team at DeepMind led by Jordan Hoffmann published "Training Compute-Optimal Large Language Models," introducing a model called Chinchilla that challenged Kaplan's compute allocation.
Hoffmann et al. conducted a more careful analysis using three complementary approaches:
- Fixed-model experiments: Training models of different sizes for different numbers of tokens, fitting the resulting loss surface.
- IsoFLOP analysis: For each fixed compute budget, finding the optimal model size empirically.
- Parametric loss fitting: Fitting a refined version of the bivariate scaling law.
All three approaches converged on the same conclusion: the optimal allocation gives roughly equal scaling to model size and training tokens:
$$N_{\text{opt}} \propto C^{0.50}, \quad D_{\text{opt}} \propto C^{0.50}$$
This means that for every doubling of compute, you should double both the model size and the training data. The optimal ratio is approximately 20 tokens per parameter---an order of magnitude more data-intensive than Kaplan's recommendation.
22.3.2 The Chinchilla Model
To validate their findings, Hoffmann et al. trained Chinchilla: a 70-billion-parameter model trained on 1.4 trillion tokens. For comparison, the Gopher model (also from DeepMind) had 280 billion parameters but was trained on only 300 billion tokens---roughly the same compute budget.
The results were dramatic. Chinchilla outperformed Gopher on virtually every benchmark despite being 4x smaller. It also outperformed GPT-3 (175B parameters, 300B tokens) and various other models that were significantly larger. The implications were clear: the field had been dramatically undertrained its models.
22.3.3 The Refined Scaling Law
The Chinchilla team proposed a refined loss function:
$$L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$$
where $E$ represents the irreducible entropy of natural language, $A/N^\alpha$ captures the error due to finite model capacity, and $B/D^\beta$ captures the error due to finite training data. They estimated $\alpha \approx 0.34$, $\beta \approx 0.28$, and $E \approx 1.69$ nats.
This additive decomposition is more interpretable than Kaplan's multiplicative form. It makes explicit that loss has three components: an irreducible floor, a model-size-dependent term, and a data-dependent term. The optimal compute allocation minimizes the sum of the last two terms subject to the constraint $C \approx 6ND$.
22.3.4 Practical Implications
The Chinchilla result had immediate and far-reaching consequences:
Retraining priorities. Labs realized that existing large models were significantly undertrained. This motivated training campaigns like LLaMA (65B parameters, 1.4T tokens) and LLaMA 2 (70B parameters, 2T tokens), both of which followed Chinchilla-optimal ratios.
Smaller, better models. Instead of always building the largest possible model, teams began building smaller models trained on more data. A well-trained 7B model can often match an undertrained 65B model on many tasks.
Data becomes the bottleneck. When models need 20x more tokens than parameters, a 100B-parameter model requires 2 trillion training tokens. High-quality text data on the internet is finite, and the community began confronting data scarcity---leading to research on synthetic data generation, data filtering, and repeated-epoch training.
Beyond Chinchilla-optimal. In practice, many production LLMs are trained well beyond the Chinchilla-optimal point. The reason is inference cost: a smaller model trained on more data costs less to serve per query, even if the training compute is higher than "optimal." LLaMA 3 (8B parameters trained on 15T tokens) exemplifies this inference-optimal strategy. The optimal training point depends on the expected lifetime query volume.
22.3.5 Post-Chinchilla Scaling Research
Subsequent research has refined and extended the Chinchilla findings:
- Data quality scaling laws (Muennighoff et al., 2023): The scaling exponent depends on data quality. Filtered, deduplicated data yields steeper scaling curves.
- Repeated data (Hernandez et al., 2022): Training on the same data multiple times follows a modified scaling law with diminishing returns per epoch.
- Inference-adjusted optimality (Sardana & Frankle, 2023): When total inference compute is factored in, the optimal model is often significantly smaller than Chinchilla-optimal.
- Scaling with RL (Gao et al., 2023): RLHF and other post-training methods have their own scaling laws, partially independent of pre-training scale.
22.4 Emergent Abilities
22.4.1 Defining Emergence
One of the most debated phenomena in LLM research is the concept of emergent abilities---capabilities that appear only in models above a certain scale and are not predictable by extrapolating from smaller models. Wei et al. (2022) defined an emergent ability as one that is "not present in smaller models but is present in larger models," with performance transitioning sharply from near-chance to significantly above chance.
Examples of claimed emergent abilities include:
- Multi-step arithmetic: Models below ~10B parameters fail almost completely at three-digit addition, while models above ~100B succeed with high accuracy.
- Chain-of-thought reasoning: The ability to decompose a complex problem into intermediate steps emerges predominantly in large models (>60B parameters).
- Word unscrambling: Rearranging scrambled words shows a sharp phase transition around 6B--60B parameters.
- International phonetic alphabet transliteration: Near-zero accuracy below a threshold, then a sudden jump.
22.4.2 The Phase Transition Narrative
The initial framing of emergent abilities drew on the physics concept of phase transitions---phenomena where a system's qualitative behavior changes abruptly at a critical point, like water turning to ice at 0 degrees Celsius. Under this narrative, scaling is not merely about "more of the same"; it can produce qualitatively new capabilities at unpredictable thresholds.
This narrative had significant implications:
- It suggested that future, larger models might acquire entirely unexpected capabilities.
- It motivated aggressive scaling even when returns on current benchmarks appeared modest.
- It created both excitement (new capabilities!) and concern (unpredictable risks!) in the AI safety community.
22.4.3 The Critique: Are Emergent Abilities a Mirage?
In 2023, Schaeffer, Miranda, and Koyejo published "Are Emergent Abilities of Large Language Models a Mirage?", arguing that apparent emergence is an artifact of metric choice rather than a genuine phase transition. Their key argument:
-
Nonlinear metrics: Metrics like exact-match accuracy impose a sharp threshold (the answer is either perfectly correct or it scores zero). A model whose per-token accuracy improves smoothly with scale will show a sudden jump in exact-match accuracy when the per-token accuracy crosses the threshold needed for the full answer to be correct.
-
Linear metrics show smooth improvement: When the same tasks are measured with linear metrics (e.g., token-level edit distance, partial credit), the "emergence" disappears and performance scales smoothly.
-
Resolution of the paradox: The authors argue that the underlying capability (e.g., arithmetic accuracy per digit) improves gradually and predictably with scale. The apparent phase transition is a property of the measurement, not the model.
This critique does not mean that scale is unimportant---clearly, larger models are more capable. Rather, it challenges the narrative that scale produces genuinely discontinuous jumps in capability. The debate remains open, with some researchers arguing that certain capabilities (like in-context learning itself) may still represent true emergence.
22.4.4 Implications for Engineering
Regardless of where one stands in the emergence debate, the practical implications are clear:
- Benchmark design matters: Use linear or continuous metrics whenever possible to get smooth, predictable scaling curves.
- Do not assume small-model failure means large-model failure: A capability absent at 7B may be present at 70B.
- Test at target scale: The only reliable way to know whether a model has a given capability is to test it directly. Extrapolation from smaller models is unreliable for discrete tasks.
22.5 Major LLM Families
22.5.1 The GPT Series (OpenAI)
GPT-3 (2020): 175 billion parameters, trained on 300B tokens. Introduced few-shot in-context learning as a practical paradigm. Demonstrated that scaling alone could produce strong task performance without fine-tuning.
GPT-3.5 / InstructGPT (2022): Applied RLHF to GPT-3-class models, producing ChatGPT. This was the pivotal moment when LLMs transitioned from research tools to consumer products.
GPT-4 (2023): Rumored to be a mixture-of-experts model with approximately 1.8 trillion total parameters (unconfirmed). Multimodal (text + vision). Achieved human-level performance on numerous professional and academic exams. OpenAI published a "GPT-4 Technical Report" notable for its lack of architectural details, marking a shift toward secrecy.
GPT-4o (2024): An "omni" model natively handling text, vision, and audio in a single architecture. Faster and cheaper than GPT-4 while maintaining comparable quality. Represented a shift toward multimodal unification rather than multimodal bolting.
Key innovations: GPT-series models pioneered (1) scaling-driven capability gains, (2) the RLHF alignment pipeline, and (3) the instruction-following paradigm.
22.5.2 Claude (Anthropic)
Claude 1 (2023): A strong conversational model emphasizing helpfulness, harmlessness, and honesty (the "HHH" framework). Trained with both RLHF and Constitutional AI (CAI), where the model critiques and revises its own outputs according to a set of principles.
Claude 2 (2023): Expanded context window to 100K tokens, enabling processing of entire books and codebases. Improved reasoning and reduced hallucination rates.
Claude 3 family (2024): Three model sizes---Haiku (small, fast), Sonnet (balanced), and Opus (largest, most capable). Opus achieved state-of-the-art results on many reasoning benchmarks. The family demonstrated that a range of model sizes is needed to serve different use cases.
Claude 3.5 Sonnet (2024): Surpassed Claude 3 Opus on most benchmarks at lower cost and latency, demonstrating that smaller models with better training can overtake larger predecessors.
Key innovations: Constitutional AI, long-context processing, safety-focused training, and the multi-tier model family strategy.
22.5.3 The Llama Series (Meta)
LLaMA (2023): The first wave of truly competitive open-weight LLMs. Sizes ranged from 7B to 65B parameters, all trained on Chinchilla-optimal data volumes (1--1.4T tokens of publicly available data). The 65B model matched GPT-3.5 on many benchmarks. The release catalyzed an explosion of open-source LLM research.
Llama 2 (2023): Scaled training data to 2T tokens. Added a 34B size variant. Released with a more permissive license. The 70B chat model, fine-tuned with RLHF, became the strongest open-weight instruction-following model at the time.
Llama 3 (2024): A dramatic shift toward inference-optimal training. The 8B model was trained on 15T tokens---nearly 2000 tokens per parameter, far beyond Chinchilla-optimal. The 70B model used 15T tokens as well. A 405B dense model was also released, representing the largest open-weight dense model ever. Used a tokenizer with 128K vocabulary, GQA (Grouped-Query Attention), and a context length of 128K tokens.
Key innovations: Demonstrating that open-weight models can match proprietary ones, pioneering inference-optimal training, and catalyzing the open-source ecosystem (fine-tunes, quantized variants, specialized adaptations).
22.5.4 Mistral and Mixtral (Mistral AI)
Mistral 7B (2023): A 7B-parameter model that outperformed Llama 2 13B on most benchmarks. Used Sliding Window Attention (SWA) to handle longer sequences efficiently and Grouped-Query Attention (GQA) for faster inference.
Mixtral 8x7B (2023): A Mixture of Experts model with 8 experts of 7B parameters each (46.7B total parameters, ~12.9B active per token). Matched or exceeded Llama 2 70B and GPT-3.5 on most benchmarks while being significantly faster at inference. This was a landmark demonstration of MoE's practical viability for LLMs.
Mixtral 8x22B (2024): Scaled the MoE approach further, with 176B total parameters (~39B active). Strong multilingual and code generation capabilities.
Key innovations: Proving MoE's viability at scale, Sliding Window Attention, and demonstrating that small European AI labs can compete with well-resourced American companies.
22.5.5 Gemini (Google DeepMind)
Gemini 1.0 (2023): Three sizes: Ultra, Pro, and Nano. Ultra was Google's most capable model, competitive with GPT-4 on most benchmarks. Natively multimodal---trained from the ground up on interleaved text, image, audio, and video data, rather than bolting vision onto a text model.
Gemini 1.5 Pro (2024): Featured a context window of up to 1 million tokens (later extended to 2 million), the longest of any production LLM. Achieved near-perfect retrieval across the entire context window. Used a MoE architecture for efficiency.
Gemini 2.0 (2024--2025): Focused on agentic capabilities---the ability to use tools, browse the web, and take actions in the real world. Introduced "Deep Research" for autonomous multi-step investigation tasks.
Key innovations: Native multimodality, million-token context windows, and MoE-based efficiency.
22.5.6 Other Notable Models
Several other model families deserve mention for their contributions to the LLM landscape:
Cohere Command R+ (2024): Optimized for retrieval-augmented generation (RAG), with strong citation generation and grounded responses. Demonstrated that specializing for a specific use case can be more valuable than winning general benchmarks.
Qwen 2 (Alibaba, 2024): A family of open-weight models ranging from 0.5B to 72B parameters, with strong multilingual capabilities (particularly Chinese-English). Demonstrated that non-Western labs could produce competitive open-weight models.
DeepSeek-V2 and V3 (2024): Used an innovative multi-head latent attention (MLA) mechanism that compresses key-value representations, dramatically reducing KV cache memory. DeepSeek-V3 used 256 fine-grained experts with top-8 routing plus one always-on shared expert, and its 671B total parameters (37B active) achieved performance competitive with GPT-4o and Claude 3.5 Sonnet. The model was trained on 14.8T tokens with an estimated cost of approximately $5.6 million---a fraction of what comparable Western models cost, challenging assumptions about the compute requirements for frontier models.
Phi-3 (Microsoft, 2024): A family of small language models (3.8B, 7B, 14B) trained on carefully curated "textbook-quality" synthetic data. The Phi-3-mini (3.8B) achieved performance comparable to much larger models on reasoning benchmarks, demonstrating that data quality can partially substitute for model scale.
22.5.7 The Convergence of Architecture
Examining these model families reveals a striking convergence in architectural choices:
| Component | Common Choice | Why |
|---|---|---|
| Position encoding | RoPE | Relative position, length extrapolation |
| Activation function | SwiGLU | Better than GELU/ReLU empirically |
| Normalization | RMSNorm (pre-norm) | Faster than LayerNorm, more stable |
| Attention variant | GQA | Reduces KV cache by 4--8x |
| Vocabulary size | 32K--128K | Balances efficiency and coverage |
| Architecture | MoE (frontier) / Dense (smaller) | MoE for scale, dense for simplicity |
The fact that all major labs independently converge on the same choices suggests these represent genuine optima rather than arbitrary decisions. The competitive advantage now lies primarily in training data (quality, scale, diversity), post-training (instruction tuning, RLHF/DPO, safety), and systems engineering (training infrastructure, serving efficiency), rather than in novel architectures.
22.5.8 Comparative Analysis
| Feature | GPT-4 | Claude 3.5 | Llama 3 405B | Mixtral 8x22B | Gemini 1.5 Pro |
|---|---|---|---|---|---|
| Parameters | ~1.8T (est.) | Undisclosed | 405B | 176B (39B active) | Undisclosed |
| Architecture | MoE (est.) | Dense (est.) | Dense | MoE (8 experts) | MoE |
| Context window | 128K | 200K | 128K | 64K | 1M--2M |
| Open weights | No | No | Yes | Yes | No |
| Multimodal | Text + Vision | Text + Vision | Text (+ Vision) | Text | Text + Vision + Audio |
| Training data | Undisclosed | Undisclosed | 15T tokens | Undisclosed | Undisclosed |
22.6 Benchmarking Large Language Models
22.6.1 Why Benchmarking Is Hard
Evaluating LLMs is fundamentally more challenging than evaluating traditional ML models. The reasons include:
- Task breadth: LLMs are expected to perform well on thousands of diverse tasks.
- Sensitivity to prompting: The same model can score very differently depending on the prompt format, number of few-shot examples, and evaluation protocol.
- Contamination: Training data may overlap with benchmark test sets, inflating scores.
- Saturation: Models approach or reach human-level performance on many benchmarks, reducing their discriminative power.
- Benchmark gaming: Organizations may optimize specifically for popular benchmarks.
22.6.2 Major Benchmarks
MMLU (Massive Multitask Language Understanding). Introduced by Hendrycks et al. (2021), MMLU consists of 57 subjects spanning STEM, humanities, social sciences, and professional domains (law, medicine, accounting). Each subject has multiple-choice questions at various difficulty levels. MMLU has become the de facto standard for assessing broad knowledge, though it is increasingly saturated (top models score >90%).
$$\text{MMLU Score} = \frac{1}{|S|} \sum_{s \in S} \text{Accuracy}_s$$
where $S$ is the set of subjects and $\text{Accuracy}_s$ is the accuracy on subject $s$.
HumanEval. Introduced by Chen et al. (2021) at OpenAI, HumanEval is a benchmark for code generation consisting of 164 Python programming problems with function signatures, docstrings, and unit tests. The metric is pass@k---the probability that at least one of $k$ generated solutions passes all unit tests:
$$\text{pass@}k = \mathbb{E}_{\text{problems}} \left[1 - \frac{\binom{n - c}{k}}{\binom{n}{k}}\right]$$
where $n$ is the total number of samples and $c$ is the number of correct samples.
GSM8K (Grade School Math 8K). A dataset of 8,500 grade-school-level math word problems requiring multi-step reasoning. Models must produce both the reasoning chain and the final numerical answer. Performance is measured by exact match on the final answer.
MATH. A dataset of 12,500 competition-level mathematics problems covering algebra, geometry, number theory, and more. Significantly harder than GSM8K, requiring sophisticated mathematical reasoning.
HELM (Holistic Evaluation of Language Models). Developed by Stanford's Center for Research on Foundation Models, HELM evaluates models across 42 scenarios covering accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. It provides the most comprehensive single evaluation framework.
BIG-Bench / BIG-Bench Hard (BBH). A collaborative benchmark of 204 tasks designed to probe capabilities beyond those measured by existing benchmarks. BIG-Bench Hard (BBH) consists of the 23 tasks where prior language models failed to exceed average human-rater performance.
MT-Bench. A multi-turn conversational benchmark with 80 high-quality questions across 8 categories. Responses are evaluated by GPT-4 (or another strong model) acting as a judge, producing scores from 1 to 10. This "LLM-as-judge" paradigm has become increasingly popular.
Arena Elo / Chatbot Arena. A crowdsourced evaluation platform (LMSYS) where users chat with two anonymous models side-by-side and vote for the better response. Results are aggregated into Elo ratings, providing a signal that correlates with real-world user preference.
22.6.3 Evaluation Protocols
The choice of evaluation protocol can dramatically affect scores. Key decisions include:
Few-shot vs. zero-shot: Many benchmarks report 5-shot results by default (e.g., MMLU). Switching to zero-shot can change rankings.
Chain-of-thought: Allowing models to reason step-by-step (CoT) dramatically improves performance on math and reasoning tasks. Some benchmarks report both CoT and direct-answer scores.
Constrained generation: For multiple-choice tasks, models can be evaluated by (1) comparing log-probabilities of answer tokens, (2) generating the full answer and parsing it, or (3) using constrained decoding. These methods can yield different results.
Temperature and sampling: Deterministic evaluation (temperature 0 or greedy decoding) is standard for benchmarks, but real-world usage involves sampling with non-zero temperature.
22.6.4 Benchmark Ecosystems and Leaderboards
The LLM evaluation landscape has coalesced around several leaderboards and ecosystems:
Open LLM Leaderboard (HuggingFace): An automated evaluation platform that runs standardized benchmarks (ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K) on any model hosted on the HuggingFace Hub. Its accessibility has made it the de facto ranking system for open-weight models, though it has been criticized for encouraging benchmark overfitting.
LMSYS Chatbot Arena: A crowdsourced platform where human users interact with pairs of anonymous models and vote for the better response. The resulting Elo ratings provide the closest approximation to real-world user preference. As of 2024, Arena Elo has become the most trusted single metric for comparing LLMs, because it resists gaming (models cannot be optimized for anonymous human preferences as easily as for fixed benchmark questions).
SEAL (Safety, Ethics, and Alignment Leaderboard): Uses private, non-public test sets to evaluate safety and alignment properties, resisting contamination by design.
LiveBench: A regularly refreshed benchmark that generates new questions monthly, making contamination through training data nearly impossible.
The proliferation of leaderboards reflects a fundamental tension: no single evaluation can capture the multifaceted nature of LLM capability. A model that tops MMLU may perform poorly on code generation; a model that excels at Chatbot Arena may score modestly on formal reasoning benchmarks. Responsible evaluation requires triangulating across multiple benchmarks and being explicit about what each measures.
22.6.5 The Contamination Problem
Training data contamination occurs when benchmark test examples appear in the model's pre-training data. Because LLMs are trained on web-scale data, and many benchmarks are publicly available online, contamination is difficult to avoid entirely.
Mitigation strategies include:
- Canary strings: Embedding unique strings in test sets to detect if they appear in training data.
- Temporal holdout: Using benchmarks created after the model's training data cutoff.
- Dynamic benchmarks: Generating new test instances programmatically (e.g., new math problems with the same distribution).
- Private test sets: Keeping test examples confidential (e.g., SEAL leaderboard).
22.7 The Instruction-Following Paradigm
22.7.1 From Language Modeling to Task Completion
Raw pre-trained language models are next-token predictors. Given a prompt like "Summarize the following article:", a base model might continue with "This is a common task in NLP..." rather than actually summarizing. The model has no inherent notion that the user's text is an instruction to be followed.
The instruction-following paradigm bridges this gap through a multi-stage post-training pipeline:
-
Supervised Fine-Tuning (SFT): Train on a curated dataset of (instruction, response) pairs, teaching the model to interpret natural language instructions and produce appropriate outputs.
-
Reinforcement Learning from Human Feedback (RLHF): Train a reward model on human preference data, then use reinforcement learning (typically PPO) to optimize the language model's policy toward higher-reward responses.
-
Direct Preference Optimization (DPO): A simplified alternative to RLHF that directly optimizes the language model on preference pairs without a separate reward model.
We will cover RLHF and DPO in detail in Chapter 25. Here, we focus on how instruction tuning transforms a base model's behavior.
22.7.2 The Instruction Tuning Dataset
Instruction tuning datasets typically include:
- Direct instructions: "Write a haiku about machine learning."
- Question answering: "What is the capital of France?"
- Reasoning tasks: "A farmer has 17 sheep. All but 9 run away. How many are left?"
- Coding tasks: "Write a Python function that computes the Fibonacci sequence."
- Multi-turn conversations: A sequence of user-assistant exchanges.
- Refusal examples: Instructions the model should decline (harmful requests).
Key datasets include FLAN (Wei et al., 2022), OpenAssistant Conversations, ShareGPT, and Anthropic's HH-RLHF.
22.7.3 The Effect of Instruction Tuning
Instruction tuning produces several measurable changes:
- Format compliance: The model responds in the expected format (answering questions directly rather than continuing the prompt).
- Helpfulness: Responses are more detailed, structured, and useful.
- Safety: The model learns to refuse harmful requests.
- Reduced toxicity: Toxic language generation decreases substantially.
- Alignment tax: On some pure knowledge benchmarks (e.g., raw MMLU), instruction-tuned models can score slightly lower than base models, because the format-following bias introduces a small overhead. However, on practical tasks, instruction-tuned models are dramatically better.
22.7.4 The Scaling of Instruction-Following Ability
Instruction-following ability itself exhibits scaling behavior, though it is not as clean as pre-training loss scaling:
- Base capability scales smoothly: A model's ability to answer factual questions and perform simple formatting improves gradually with pre-training scale.
- Complex instruction following requires minimum scale: Following multi-constraint instructions (e.g., "Write a poem in iambic pentameter about machine learning that avoids the letter 'e'") requires models above a certain size threshold (roughly 10B+ parameters).
- Post-training amplifies but does not create capabilities: Instruction tuning can surface capabilities that the base model possesses but does not express in the right format. It cannot teach the model facts or reasoning abilities it did not acquire during pre-training.
- Diminishing returns from post-training: The gap between base and instruction-tuned models narrows as base model quality increases. A well-trained 70B base model already responds somewhat appropriately to instructions even without SFT.
22.7.5 System Prompts and Role Design
Modern instruction-following models support a system prompt---a special instruction that defines the model's persona, capabilities, constraints, and behavioral guidelines. The system prompt is typically prepended to every conversation and is treated as higher-priority than user instructions.
Effective system prompts specify:
- Role: "You are a helpful AI assistant specializing in Python programming."
- Behavioral constraints: "Never produce harmful content. If unsure, say so."
- Output format: "Respond in JSON format with fields 'answer' and 'confidence'."
- Knowledge boundaries: "You have access to knowledge up to April 2024."
The effectiveness of system prompts depends heavily on model training. Well-aligned models (e.g., Claude, GPT-4) generally follow system prompts reliably, but adversarial prompting can sometimes override them (a concern we will address in Chapter 23).
22.8 Tokenizer Effects
22.8.1 Why Tokenizers Matter
The tokenizer sits between raw text and the language model, converting character strings into sequences of integer token IDs and back. While often treated as a preprocessing detail, tokenizer design has profound effects on model behavior, efficiency, and capabilities.
A language model with context length $C$ can process $C$ tokens. If the tokenizer is efficient---representing common words and phrases as single tokens---those $C$ tokens encode much more text than if the tokenizer is wasteful. The tokenizer also determines the vocabulary size $V$, which affects the embedding matrix size ($V \times d_{\text{model}}$) and the softmax computation at the output layer.
22.8.2 Byte-Pair Encoding (BPE)
Most modern LLMs use some variant of Byte-Pair Encoding (Sennrich et al., 2016). BPE starts with a base vocabulary of individual bytes (or characters) and iteratively merges the most frequent pair of adjacent tokens into a new token, continuing until the desired vocabulary size is reached.
The algorithm:
- Initialize the vocabulary with all individual bytes (256 tokens).
- Count all adjacent token pairs in the training corpus.
- Merge the most frequent pair into a new token.
- Repeat steps 2--3 until the target vocabulary size is reached.
GPT-2/GPT-3 use BPE with a vocabulary of 50,257 tokens, trained on English-dominated web text. This tokenizer is efficient for English but inefficient for non-Latin scripts---Chinese characters, for example, often require 2--3 tokens each.
GPT-4 uses cl100k_base with 100,256 tokens, significantly improving efficiency for non-English languages and code.
22.8.3 SentencePiece and Unigram
SentencePiece (Kudo & Richardson, 2018) is a language-independent tokenization library that operates directly on raw text without requiring pre-tokenization (e.g., word splitting). It supports both BPE and Unigram tokenization algorithms.
The Unigram algorithm (Kudo, 2018) takes a fundamentally different approach from BPE. Instead of building up the vocabulary by merging, it starts with a large vocabulary and iteratively removes tokens whose removal least increases the overall training loss. This produces a vocabulary where each token has an associated probability, and tokenization is performed by finding the maximum-probability segmentation using the Viterbi algorithm.
LLaMA models use SentencePiece with BPE, while models like T5 use SentencePiece with Unigram.
22.8.4 Tokenizer Effects on Performance
Multilinguality. A tokenizer trained primarily on English will be inefficient for other languages. The "fertility" of a tokenizer---the average number of tokens per word---varies dramatically across languages. For GPT-2's tokenizer, English text averages ~1.3 tokens per word, but Burmese averages ~15 tokens per word. This means the model's effective context window is roughly 10x shorter for Burmese than for English.
Arithmetic. Tokenizers that split numbers inconsistently (e.g., "1234" might tokenize as ["12", "34"] or ["1", "234"]) make arithmetic harder for the model because the individual digits are not separate tokens.
Code. Code-optimized tokenizers include common programming patterns (e.g., whitespace indentation, function names) as single tokens. The Codex tokenizer, for example, has special tokens for common Python patterns.
Vocabulary size trade-offs. Larger vocabularies improve encoding efficiency (fewer tokens per text) but increase the embedding matrix size and make the output softmax more expensive. They also require more training data to learn good representations for each token. The field has converged on vocabularies in the 32K--128K range as a practical sweet spot.
22.8.5 Tiktoken and Modern Tokenizers
Llama 3 adopted a 128K-token vocabulary with tiktoken-style BPE. This larger vocabulary brought several benefits:
- Higher encoding efficiency across languages (3--4 tokens per word for many languages, vs. 10+ previously).
- Better code handling.
- Reduced sequence length for the same text, effectively increasing the context window.
The tokenizer choice is one of the few architectural decisions that cannot be changed after pre-training---the entire model's vocabulary embedding matrix depends on it. Consequently, tokenizer design deserves careful attention before training begins.
22.8.6 Tokenization and Cost
In the API-based LLM economy, users pay per token. Tokenizer efficiency therefore has direct financial implications. A 100-page document might cost $0.50 to process with an efficient tokenizer but $1.50 with an inefficient one for the same content in a different language. This creates a hidden "language tax" that disproportionately affects non-English users and applications.
Some researchers have proposed returning to byte-level models that process raw bytes without any tokenizer (e.g., MegaByte, Mamba). These approaches eliminate tokenizer-related disparities but face challenges with very long sequence lengths (a single English word of 5 characters becomes a sequence of 5 bytes, dramatically increasing the effective sequence length).
22.9 Context Window Evolution
22.9.1 The Journey from 512 to Millions
The maximum sequence length a model can process---its context window---has grown exponentially:
| Year | Model | Context Window |
|---|---|---|
| 2017 | Original Transformer | 512 tokens |
| 2018 | GPT-1 | 512 tokens |
| 2019 | GPT-2 | 1,024 tokens |
| 2020 | GPT-3 | 2,048 tokens |
| 2023 | GPT-4 | 8,192 / 32K tokens |
| 2023 | Claude 2 | 100K tokens |
| 2024 | GPT-4 Turbo | 128K tokens |
| 2024 | Claude 3 | 200K tokens |
| 2024 | Gemini 1.5 Pro | 1M--2M tokens |
| 2024 | Llama 3 | 128K tokens |
This 4000x expansion required solving several fundamental challenges.
22.9.2 The Quadratic Attention Bottleneck
Standard self-attention computes pairwise interactions between all tokens, giving $O(L^2)$ time and memory complexity where $L$ is the sequence length. For $L = 2048$, this is manageable. For $L = 1{,}000{,}000$, it requires computing $10^{12}$ attention scores per layer---clearly infeasible without modification.
22.9.3 Position Encoding for Long Contexts
The original Transformer used sinusoidal positional encodings, which are defined for any position but provide no length generalization guarantees. Learned positional embeddings (GPT-1/2/3) are limited to the training-time maximum length.
Rotary Position Embeddings (RoPE) (Su et al., 2021) encode position by rotating the query and key vectors in 2D subspaces. The rotation angle for position $m$ and dimension $i$ is:
$$\theta_{m,i} = m \cdot \theta_{\text{base}}^{-2i/d}$$
where $\theta_{\text{base}}$ is typically 10,000. RoPE has two key properties: (1) the attention score between tokens depends only on their relative position, and (2) it can be extended to longer sequences than seen during training by adjusting $\theta_{\text{base}}$ (the "NTK-aware" scaling method).
ALiBi (Attention with Linear Biases) (Press et al., 2022) takes an even simpler approach: it adds a linear bias to the attention score based on the distance between query and key positions:
$$\text{Attention}_{ij} = \text{softmax}\left(\frac{q_i^\top k_j}{\sqrt{d_k}} - m \cdot |i - j|\right) v_j$$
where $m$ is a head-specific slope. ALiBi requires no learned positional parameters and generalizes well to longer sequences, but has been largely superseded by RoPE in practice.
22.9.4 Efficient Attention Mechanisms
FlashAttention (Dao et al., 2022) does not change the mathematical operation---it computes exact standard attention---but restructures the computation to minimize memory reads/writes (I/O). By tiling the attention computation and keeping intermediate results in fast SRAM rather than slow HBM, FlashAttention achieves 2--4x speedups and reduces memory from $O(L^2)$ to $O(L)$. FlashAttention-2 and FlashAttention-3 further optimize for modern GPU architectures.
Ring Attention (Liu et al., 2023) distributes the attention computation across multiple GPUs by having each device compute attention with its local key-value block, then passing key-value blocks around in a ring topology. This enables context lengths that scale linearly with the number of GPUs.
Sliding Window Attention (Beltagy et al., 2020; Mistral, 2023) limits each token to attending only to the previous $W$ tokens (e.g., $W = 4096$). While this prevents direct attention to distant tokens, information can propagate through the network across multiple layers. With $L$ layers and window $W$, the effective receptive field is $L \times W$.
22.9.5 Context Window and Retrieval Quality
A larger context window does not automatically mean better performance on long-context tasks. Liu et al. (2023) demonstrated the "Lost in the Middle" phenomenon: models perform well on information at the beginning and end of the context but poorly on information in the middle. This U-shaped retrieval curve suggests that attention mechanisms do not distribute focus uniformly across long contexts.
Addressing this requires:
- Position interpolation: Extending RoPE to longer sequences via interpolation rather than extrapolation.
- Long-context fine-tuning: Specifically training on tasks that require attending to the middle of long documents.
- Retrieval augmentation: Combining long context windows with explicit retrieval mechanisms (RAG, Chapter 31).
22.10 Mixture of Experts (MoE)
22.10.1 Motivation: Decoupling Parameters from Compute
Dense Transformer models use all parameters for every input token. A 70B-parameter model performs 70B parameter-worth of computation per token. Mixture of Experts (MoE) breaks this coupling: a model can have 1 trillion total parameters but only activate 100 billion for each token, achieving the quality of the large model at the cost of the small one.
The key insight is that different tokens may benefit from different "expert" subnetworks. A token about mathematics might activate experts specializing in logical reasoning, while a token about poetry might activate different experts. The model learns to route tokens to appropriate experts automatically.
22.10.2 Architecture
A standard MoE Transformer replaces some or all of the feed-forward network (FFN) blocks with MoE layers. Each MoE layer consists of:
-
$E$ expert networks: Each expert is a standard FFN (typically two linear layers with a nonlinearity). All experts have the same architecture but different parameters.
-
A routing function (gating network): A learned function that maps each token's hidden state to a probability distribution over experts:
$$G(x) = \text{softmax}(W_g \cdot x + \text{noise})$$
where $W_g \in \mathbb{R}^{E \times d_{\text{model}}}$ is the gating weight matrix and "noise" is optional noise added during training for exploration.
- Top-k selection: Only the top $k$ experts (typically $k = 1$ or $k = 2$) are activated for each token. The output is a weighted sum of the selected experts' outputs:
$$y = \sum_{i \in \text{Top-}k(G(x))} G(x)_i \cdot \text{Expert}_i(x)$$
22.10.3 The Load Balancing Challenge
A naive routing function will often collapse to using only a few experts, wasting the capacity of the others. This "expert collapse" problem is addressed through auxiliary losses that encourage balanced expert utilization.
The standard load balancing loss (Fedus et al., 2021) penalizes uneven expert usage:
$$\mathcal{L}_{\text{balance}} = \alpha \cdot E \cdot \sum_{i=1}^{E} f_i \cdot P_i$$
where $f_i$ is the fraction of tokens routed to expert $i$, $P_i$ is the average routing probability for expert $i$, and $\alpha$ is a hyperparameter (typically 0.01). This loss is minimized when all experts receive equal traffic ($f_i = 1/E$) and equal probability ($P_i = 1/E$).
Expert capacity. An additional mechanism is to set a maximum capacity per expert: each expert can process at most $C_{\text{cap}}$ tokens per batch. Tokens that exceed the capacity are either dropped (assigned a zero output) or overflowed to a shared "residual" expert.
22.10.4 Historical Context and Modern MoE
The MoE concept predates modern deep learning. Jacobs et al. (1991) proposed Mixtures of Experts with an EM-style training algorithm. Shazeer et al. (2017) adapted MoE for deep learning, introducing sparsely-gated mixture-of-experts layers within LSTMs.
The modern MoE renaissance was driven by two models:
Switch Transformer (Fedus et al., 2021): Used top-1 routing (each token goes to exactly one expert) with a simplified load balancing loss. Achieved 4--7x speedups over dense T5 models at the same compute budget. Demonstrated scaling to over 1 trillion parameters.
GShard (Lepikhin et al., 2021): Scaled MoE to 600B parameters for machine translation, using top-2 routing and expert parallelism across thousands of TPU cores.
22.10.5 MoE in Modern LLMs
Several leading LLMs use MoE architectures:
Mixtral 8x7B: 8 experts per MoE layer, top-2 routing. 46.7B total parameters, 12.9B active per token. Every other FFN layer is replaced with an MoE layer.
Mixtral 8x22B: Same structure at larger scale. 176B total, ~39B active.
GPT-4: Widely reported (though not officially confirmed) to use a MoE architecture with approximately 16 experts.
Gemini 1.5 Pro: Confirmed to use MoE, which is key to enabling the million-token context window at reasonable cost.
DeepSeek-V2 and V3: Used MoE with innovative fine-grained expert design. DeepSeek-V3 (2024) used 256 fine-grained experts with top-8 routing, plus 1 shared expert that always activates. This achieved strong performance with only 37B active parameters out of 671B total.
22.10.6 MoE Trade-offs
Advantages: - Higher quality for the same compute budget (or equal quality at lower compute). - Enables scaling to very large parameter counts without proportional compute increase. - Experts can specialize, potentially improving performance on diverse tasks.
Disadvantages: - Memory: All expert parameters must be stored in memory, even though only a fraction are used per token. A 1T-parameter MoE model requires 1T parameters worth of memory. - Communication overhead: In distributed training, tokens must be routed to the correct expert, which may reside on a different GPU. This all-to-all communication can be a bottleneck. - Training instability: MoE models are more prone to training instabilities, requiring careful hyperparameter tuning. - Inference complexity: Batch processing is complicated because different tokens in the same batch may route to different experts, reducing GPU utilization. - Expert collapse: Without proper regularization, the model may underutilize many experts.
22.10.7 The Mathematics of MoE Scaling
From a scaling law perspective, MoE models follow different curves than dense models. If a dense model has scaling exponent $\alpha$, an MoE model with sparsity ratio $s$ (fraction of parameters active per token) empirically achieves a scaling exponent close to $\alpha$ but with a shifted constant:
$$L_{\text{MoE}}(N_{\text{total}}, s) \approx L_{\text{dense}}(N_{\text{total}} \cdot s^{-\gamma})$$
where $\gamma < 1$ reflects the fact that inactive parameters still contribute some value (they were trained on different token subsets). In practice, MoE models achieve roughly the same quality as a dense model 2--4x their active parameter count, suggesting $\gamma \approx 0.5$--$0.7$.
22.11 Putting It All Together: The Modern LLM Pipeline
22.11.1 Pre-training
The pre-training pipeline for a modern LLM involves:
- Data collection: Web scraping (CommonCrawl, etc.), books, academic papers, code repositories, multilingual text.
- Data filtering: Deduplication, quality filtering (perplexity-based, classifier-based), toxicity removal, PII scrubbing.
- Tokenizer training: BPE or Unigram on a representative sample, typically 32K--128K vocabulary.
- Architecture selection: Dense or MoE, number of layers, hidden dimension, number of heads, activation function (SwiGLU is standard), normalization (RMSNorm), position encoding (RoPE).
- Training: Mixed-precision (bf16), distributed across thousands of GPUs, using 3D parallelism (data, tensor, pipeline). Typical duration: weeks to months.
22.11.2 Post-training
Post-training transforms the base model into an instruction-following assistant:
- Supervised Fine-Tuning (SFT): Thousands to millions of instruction-response pairs.
- Preference optimization: RLHF (PPO) or DPO on human preference data.
- Safety training: Red-teaming, Constitutional AI, targeted safety fine-tuning.
- Tool use training: Teaching the model to call functions, search the web, execute code.
- Evaluation: MMLU, HumanEval, GSM8K, internal benchmarks, human evaluation.
22.11.3 Inference Optimization
Serving a large LLM efficiently requires:
- Quantization: Reducing precision from fp16/bf16 to int8 or int4 (Chapter 33).
- KV cache management: Efficiently managing the key-value cache for long sequences (PagedAttention, vLLM).
- Speculative decoding: Using a small draft model to propose multiple tokens, verified in parallel by the large model.
- Continuous batching: Dynamically adding new requests to running batches.
- Model parallelism: Distributing the model across multiple GPUs for inference.
22.12 The Cost of Scale
Before summarizing, it is worth reflecting on the economic and environmental dimensions of LLM scaling. Training GPT-4 reportedly cost over $100 million in compute. Training a model like Llama 3 405B on 15 trillion tokens across 16,000 H100 GPUs consumes enormous amounts of electricity and generates a significant carbon footprint.
These costs create barriers to entry that concentrate AI development in a small number of well-funded organizations. Open-weight models (Llama, Mistral, DeepSeek) partially democratize access to frontier capabilities, but the ability to train such models from scratch remains restricted.
The scaling laws tell us that continued progress through scaling alone will require exponentially growing budgets. A 10x improvement in loss on the current scaling curve might cost 100x more compute. This creates pressure for efficiency innovations---better architectures, better data, better training algorithms---that can shift the scaling curve rather than simply moving along it. MoE, FlashAttention, improved tokenizers, and data quality filtering are all examples of innovations that provide "free" scaling by improving the constant factors.
The tension between scaling and efficiency is likely to define the next decade of AI development. Engineers who understand both the power and the limitations of scaling will be best positioned to make principled decisions about resource allocation.
22.13 Summary
This chapter has traced the science and engineering of scaling language models from early empirical observations to the sophisticated models deployed today. We began with the neural scaling laws discovered by Kaplan et al. and the critical Chinchilla revision that showed models were being drastically undertrained. We examined the contentious concept of emergent abilities, surveyed the major LLM families and their architectural innovations, and studied the benchmarking methodology needed to evaluate these models rigorously.
We then explored three technical dimensions that profoundly affect LLM capabilities: tokenizer design, which determines how efficiently text is encoded; context window evolution, which has expanded from 512 to millions of tokens through positional encoding innovations and efficient attention mechanisms; and Mixture of Experts, which decouples parameter count from compute cost and has become the dominant architecture for frontier models.
The key takeaways are:
-
Scaling is predictable: Cross-entropy loss follows smooth power laws with model size, data size, and compute. These laws enable principled resource allocation.
-
Chinchilla changed the game: The optimal ratio is roughly 20 tokens per parameter, and inference-optimal ratios can be even higher.
-
Emergence is contested: Whether scale produces genuinely discontinuous capability jumps depends on how you measure. Use continuous metrics for reliable predictions.
-
Architecture converges: Despite competitive pressures, all major LLM families converge on similar architectural choices: RoPE, SwiGLU, RMSNorm, GQA.
-
MoE is the future of scale: Decoupling parameters from compute enables models that would be prohibitively expensive as dense architectures.
-
The full pipeline matters: Pre-training is necessary but not sufficient. Post-training (instruction tuning, alignment) is what makes models useful.
In Chapter 23, we will explore prompt engineering and in-context learning---the practical skills for getting the most out of these powerful models.
References
- Brown, T. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020.
- Chen, M. et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374.
- Dao, T. et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022.
- Fedus, W. et al. (2021). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022.
- Hendrycks, D. et al. (2021). "Measuring Massive Multitask Language Understanding." ICLR 2021.
- Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022.
- Jacobs, R. et al. (1991). "Adaptive Mixtures of Local Experts." Neural Computation.
- Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
- Kudo, T. & Richardson, J. (2018). "SentencePiece: A Simple and Language Independent Subword Tokenizer." EMNLP 2018.
- Lepikhin, D. et al. (2021). "GShard: Scaling Giant Models with Conditional Computation." ICLR 2021.
- Liu, N. et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172.
- Press, O. et al. (2022). "ALiBi: Train Short, Test Long." ICLR 2022.
- Schaeffer, R. et al. (2023). "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS 2023.
- Sennrich, R. et al. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL 2016.
- Shazeer, N. et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017.
- Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864.
- Touvron, H. et al. (2023a). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971.
- Touvron, H. et al. (2023b). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288.
- Wei, J. et al. (2022). "Emergent Abilities of Large Language Models." TMLR 2022.