Chapter 22: Further Reading

Foundational Papers on Scaling Laws

  • Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. The original paper establishing power-law scaling relationships between model loss and model size, dataset size, and compute. Essential reading for understanding the empirical foundation of modern LLM development.

  • Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022. (The Chinchilla paper.) Revises the Kaplan scaling laws and establishes the ~20 tokens-per-parameter rule for compute-optimal training. One of the most impactful papers in recent AI history.

  • Henighan, T. et al. (2020). "Scaling Laws for Autoregressive Generative Modeling." arXiv:2010.14701. Extends scaling law analysis beyond text to images, video, math, and code, showing that power-law relationships are universal across modalities.

  • Hernandez, D. et al. (2022). "Scaling Data-Constrained Language Models." arXiv:2305.16264. Investigates what happens when training data is limited and models must train for multiple epochs. Establishes scaling laws for the repeated-data regime.

  • Sardana, N. & Frankle, J. (2023). "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws." arXiv:2401.00448. Extends scaling analysis to account for inference cost over a model's lifetime, providing the theoretical basis for inference-optimal training (e.g., Llama 3).

Emergent Abilities

  • Wei, J. et al. (2022). "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research (TMLR). The original paper defining and cataloging emergent abilities. Provides extensive evidence for sharp capability transitions at scale.

  • Schaeffer, R., Miranda, B., & Koyejo, S. (2023). "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS 2023. A rigorous critique arguing that apparent emergence is an artifact of nonlinear metrics. Essential reading for developing a balanced view.

  • Srivastava, A. et al. (2023). "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models." TMLR 2023. (BIG-Bench.) A collaborative benchmark with 204 tasks designed to probe capabilities beyond existing benchmarks. Includes extensive analysis of scaling behavior.

Major LLM Families

  • Brown, T. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. (GPT-3.) Demonstrates that sufficiently large language models can perform tasks via few-shot in-context learning without fine-tuning.

  • OpenAI (2023). "GPT-4 Technical Report." arXiv:2303.08774. Notable both for its results and for what it omits: architecture details, training data, and compute budget are not disclosed, marking a shift toward secrecy.

  • Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. The paper that launched the open-source LLM revolution. Demonstrates that competitive models can be trained on publicly available data.

  • Touvron, H. et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. Extends LLaMA with more data, a permissive license, and RLHF-trained chat models.

  • Meta AI (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. Details the inference-optimal training strategy (8B model on 15T tokens) and the 405B dense model.

  • Jiang, A. et al. (2023). "Mistral 7B." arXiv:2310.06825. Introduces sliding window attention and demonstrates that a carefully trained 7B model can outperform much larger competitors.

  • Jiang, A. et al. (2024). "Mixtral of Experts." arXiv:2401.04088. The landmark paper demonstrating practical MoE at the LLM scale. Mixtral 8x7B matches Llama 2 70B at much lower inference cost.

  • Google DeepMind (2024). "Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context." arXiv:2403.05530. Details the million-token context window and MoE architecture that enables it.

  • Anthropic (2024). "The Claude 3 Model Family." Anthropic Technical Report. Describes the Haiku-Sonnet-Opus model family and Anthropic's approach to safety and capability.

  • DeepSeek AI (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. Describes a 671B-parameter MoE model with 256 fine-grained experts and innovative multi-head latent attention.

Benchmarking and Evaluation

  • Hendrycks, D. et al. (2021). "Measuring Massive Multitask Language Understanding." ICLR 2021. (MMLU.) The de facto standard benchmark for broad knowledge evaluation. Covers 57 academic subjects.

  • Chen, M. et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374. (HumanEval.) Introduces the pass@k metric and the HumanEval benchmark for code generation.

  • Cobbe, K. et al. (2021). "Training Verifiers to Solve Math Word Problems." arXiv:2110.14168. (GSM8K.) A benchmark of grade-school math problems requiring multi-step reasoning.

  • Liang, P. et al. (2023). "Holistic Evaluation of Language Models." Annals of the New York Academy of Sciences. (HELM.) The most comprehensive evaluation framework, assessing models on accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency across 42 scenarios.

  • Zheng, L. et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. Introduces MT-Bench and the Chatbot Arena platform for human-preference-based evaluation with Elo ratings.

  • Liu, N. et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172. Demonstrates the U-shaped retrieval curve for long-context models, a critical finding for practical long-context usage.

Tokenization

  • Sennrich, R., Haddow, B., & Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. The original BPE paper for NLP, which remains the basis for most LLM tokenizers.

  • Kudo, T. (2018). "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates." ACL 2018. Introduces the Unigram tokenization algorithm and subword regularization.

  • Kudo, T. & Richardson, J. (2018). "SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing." EMNLP 2018. The SentencePiece library used by Llama and many other models.

  • Petrov, A. et al. (2024). "Language Model Tokenizers Introduce Unfairness Between Languages." arXiv:2305.15425. Quantifies the fertility disparity across languages and its impact on effective context length and cost.

Context Window and Positional Encoding

  • Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. Introduces RoPE, now the dominant positional encoding for LLMs.

  • Press, O., Smith, N., & Lewis, M. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization." ICLR 2022. (ALiBi.) An alternative positional encoding using linear biases.

  • Dao, T. et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. The algorithm that made long-context training practical by reducing attention memory from $O(L^2)$ to $O(L)$.

  • Dao, T. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." arXiv:2307.08691. Further optimizations for modern GPU architectures.

  • Liu, H. et al. (2023). "Ring Attention with Blockwise Transformers for Near-Infinite Context." arXiv:2310.01889. Enables million-token context windows by distributing attention across GPUs.

Mixture of Experts

  • Jacobs, R. et al. (1991). "Adaptive Mixtures of Local Experts." Neural Computation. The original MoE paper. Historical context for understanding the modern MoE renaissance.

  • Shazeer, N. et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. Adapted MoE for deep learning with sparsely-gated routing. The direct ancestor of modern MoE LLMs.

  • Fedus, W., Zoph, B., & Shazeer, N. (2022). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022. Simplified MoE with top-1 routing and demonstrated scaling to 1.6 trillion parameters.

  • Lepikhin, D. et al. (2021). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." ICLR 2021. Scaled MoE to 600B parameters for machine translation.

  • Clark, A. et al. (2022). "Unified Scaling Laws for Routed Language Models." ICML 2022. Develops scaling laws specific to MoE models, relating total parameters, active parameters, and number of experts.

General Overviews and Surveys

  • Sutton, R. (2019). "The Bitter Lesson." Blog post. The influential essay arguing that methods leveraging computation always win in the long run. Essential philosophical context for understanding the scaling paradigm.

  • Zhao, W.X. et al. (2023). "A Survey of Large Language Models." arXiv:2303.18223. A comprehensive survey covering architecture, training, adaptation, and evaluation of LLMs. Regularly updated.

  • Naveed, H. et al. (2024). "A Comprehensive Overview of Large Language Models." arXiv:2307.06435. Another broad survey with detailed coverage of model families, training techniques, and applications.

  • Minaee, S. et al. (2024). "Large Language Models: A Survey." arXiv:2402.06196. Covers the full LLM lifecycle from data preparation through deployment, with extensive tables of model characteristics.