Chapter 22: Key Takeaways

Scaling Laws

Power-law scaling is the central empirical fact of modern AI. Language model loss follows smooth power-law relationships with model size ($N$), dataset size ($D$), and compute ($C$). These relationships hold over many orders of magnitude and appear across architectures and data domains.
Kaplan et al. established the framework; Chinchilla corrected the allocation. The original Kaplan scaling laws (2020) recommended investing most additional compute into model size. Hoffmann et al. (2022) showed this was wrong: the optimal allocation scales model size and training data equally ($N_{\text{opt}} \propto C^{0.50}$, $D_{\text{opt}} \propto C^{0.50}$), with a ratio of approximately 20 tokens per parameter.
Many production models are trained beyond Chinchilla-optimal. The Chinchilla-optimal point minimizes training compute, but inference cost depends on model size. A smaller model trained on more data (the "inference-optimal" strategy) saves money over the model's lifetime by reducing per-query cost.
Data is now the bottleneck. When models need 20+ tokens per parameter, data scarcity becomes a first-order concern. This has motivated research into synthetic data generation, data filtering, and multi-epoch training.

Emergent abilities are real in effect but debated in mechanism. Larger models can perform tasks that smaller models cannot. Whether this represents a genuine phase transition or is an artifact of discontinuous metrics (exact-match accuracy) remains an open question.
Metric choice determines whether you see emergence. Exact-match metrics can produce apparent phase transitions even when the underlying capability improves smoothly. Use continuous metrics (edit distance, partial credit) for reliable scaling predictions.
You cannot reliably extrapolate discrete-task performance from smaller models. A capability that is absent at 7B parameters may be present at 70B. The only way to know is to test at the target scale.

Architecture has largely converged. Despite competitive pressures, all major LLM families (GPT-4, Claude, Llama, Mistral, Gemini) use similar building blocks: RoPE position encoding, SwiGLU activation, RMSNorm, and GQA. The differentiators are training data, post-training, and infrastructure.
Open-weight models have reached parity with proprietary models on many tasks. Llama 3 405B matches or approaches GPT-4 on numerous benchmarks, demonstrating that architecture and training methodology are not secrets---scale and data are.
MoE has become the dominant architecture for frontier models. GPT-4, Gemini, Mixtral, and DeepSeek all use Mixture of Experts, enabling trillion-parameter models at manageable inference cost.

Benchmarking LLMs is fundamentally harder than benchmarking traditional ML models. Sensitivity to prompt format, few-shot examples, and evaluation protocol means that scores are only meaningful within a precisely defined evaluation context.
No single benchmark tells the full story. MMLU measures knowledge, HumanEval measures coding, GSM8K measures math reasoning, and Arena Elo measures human preference. A comprehensive evaluation requires all of these and more.
Contamination is a persistent threat. Training data may overlap with test sets, inflating scores. Use temporal holdout, private test sets, and dynamic benchmarks to mitigate this.
LLM-as-judge and human evaluation complement automated metrics. MT-Bench and Chatbot Arena capture aspects of quality (helpfulness, clarity, engagement) that automated metrics miss.

The instruction-following paradigm is what makes LLMs usable. Raw pre-trained models are next-token predictors, not assistants. Supervised fine-tuning, RLHF, and DPO transform them into instruction-following systems.
System prompts provide application-level control. A well-designed system prompt defines the model's role, behavioral constraints, and output format, enabling a single model to serve many different applications.

Tokenizer design has cascading effects on model capability. Tokenizer fertility affects effective context length, multilinguality, arithmetic ability, and code generation quality. A tokenizer trained on English penalizes other languages by consuming more tokens per word.
Larger vocabularies improve efficiency but increase embedding overhead. The field has converged on 32K--128K vocabulary sizes as a practical sweet spot. Tokenizer choice is irreversible after pre-training.

Context windows have grown 4000x in seven years. From 512 tokens (2017) to 2 million tokens (2024), driven by RoPE, FlashAttention, Ring Attention, and sliding window attention.
Longer context does not automatically mean better understanding. The "Lost in the Middle" phenomenon shows that models attend poorly to information in the middle of long contexts. Targeted training and architectural innovations are needed to fully exploit long windows.

MoE decouples parameters from compute. A model can have 1T total parameters but only activate 100B per token, achieving the quality of the large model at the cost of the small one.
Load balancing is the key challenge in MoE training. Without auxiliary losses, the routing function collapses to using a few experts, wasting capacity. The load balancing loss encourages uniform expert utilization.
MoE trades compute savings for memory overhead. All expert parameters must be stored in memory even though only a fraction are used per token. This makes MoE memory-intensive but compute-efficient.
Modern MoE designs use fine-grained experts. DeepSeek-V3 uses 256 small experts with top-8 routing plus a shared expert, achieving better specialization than the 8-expert designs of earlier models.

The modern LLM pipeline is pre-training + post-training + inference optimization. Each stage has its own scaling laws, design decisions, and trade-offs. Understanding the full pipeline is essential for AI engineers building production systems.