Chapter 21: Further Reading
Foundational Papers
-
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI. GPT-1: demonstrated that unsupervised pre-training with a Transformer decoder followed by discriminative fine-tuning achieves strong results across NLP tasks. https://openai.com/research/language-unsupervised
-
Radford, A., Wu, J., Child, R., et al. (2019). "Language Models Are Unsupervised Multitask Learners." OpenAI. GPT-2: scaled up GPT-1 and introduced zero-shot task transfer, showing that a sufficiently large language model can perform tasks without fine-tuning. https://openai.com/research/better-language-models
-
Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models Are Few-Shot Learners." NeurIPS 2020. GPT-3: 175B-parameter model demonstrating in-context learning---performing tasks from a few examples in the prompt without gradient updates. https://arxiv.org/abs/2005.14165
Architecture and Training
-
Hendrycks, D. and Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv. Introduced the GELU activation function used in GPT and most modern Transformers. https://arxiv.org/abs/1606.08415
-
Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv. Introduced sparse attention patterns to extend context windows beyond standard quadratic attention. https://arxiv.org/abs/1904.10509
-
Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." NeurIPS 2019. Proposed RMSNorm, a simplified alternative to LayerNorm used in Llama and other modern LLMs. https://arxiv.org/abs/1910.07467
Text Generation and Decoding
-
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. Introduced nucleus (top-p) sampling and analyzed why likelihood-based decoding produces degenerate text. https://arxiv.org/abs/1904.09751
-
Fan, A., Lewis, M., and Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. Introduced top-k sampling for neural text generation. https://arxiv.org/abs/1805.04833
-
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., and Socher, R. (2019). "CTRL: A Conditional Transformer Language Model for Controllable Generation." arXiv. Controllable generation using control codes to guide style and content. https://arxiv.org/abs/1909.05858
-
Li, X. L., Holtzman, A., Fried, D., et al. (2022). "Contrastive Decoding: Open-ended Text Generation as Optimization." ACL 2023. Uses the difference between large and small model logits to improve generation quality. https://arxiv.org/abs/2210.15097
Efficient Inference
-
Pope, R., Douglas, S., Chowdhery, A., et al. (2023). "Efficiently Scaling Transformer Inference." MLSys 2023. Analysis of memory-bound autoregressive generation and strategies for efficient deployment. https://arxiv.org/abs/2211.05102
-
Leviathan, Y., Kalman, M., and Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. Uses a small draft model to propose tokens verified by the large model, achieving lossless speedup. https://arxiv.org/abs/2211.17192
-
Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. (2023). "Efficient Streaming Language Models with Attention Sinks." ICLR 2024. Discovered that early tokens serve as "attention sinks" and proposed StreamingLLM for infinite-length generation. https://arxiv.org/abs/2309.17453
Modern Decoder-Only Architectures
-
Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv. Open-weight LLM family with RoPE, RMSNorm, SwiGLU, and GQA. https://arxiv.org/abs/2302.13971
-
Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). "Mistral 7B." arXiv. Efficient 7B model with sliding window attention and grouped query attention. https://arxiv.org/abs/2310.06825
-
Anil, R., Dai, A. M., Firat, O., et al. (2023). "Gemini: A Family of Highly Capable Multimodal Models." arXiv. Google's multimodal model family built on decoder-only Transformer architecture. https://arxiv.org/abs/2312.11805
Scaling Laws
-
Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv. Power-law relationships between model size, data, compute, and performance. https://arxiv.org/abs/2001.08361
-
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022. The Chinchilla paper: showed that most LLMs were undertrained relative to their size and proposed optimal data/parameter ratios. https://arxiv.org/abs/2203.15556
Mechanistic Interpretability
-
Olsson, C., Elhage, N., Nanda, N., et al. (2022). "In-context Learning and Induction Heads." Anthropic. Identified induction heads as a key mechanism for in-context learning in Transformer language models. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/
-
Elhage, N., Nanda, N., Olsson, C., et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic. Residual stream perspective and circuit-level analysis of Transformer behavior. https://transformer-circuits.pub/2021/framework/
Textbooks and Tutorials
-
Jurafsky, D. and Martin, J. H. (2024). Speech and Language Processing, 3rd ed. (Draft). Chapter 10 covers autoregressive language models and the GPT architecture. https://web.stanford.edu/~jurafsky/slp3/
-
Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning Publications. Step-by-step GPT implementation with training and fine-tuning.
-
Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2023). Dive into Deep Learning. Chapter 11 includes GPT-style language model implementations. https://d2l.ai/
Online Resources
-
Andrej Karpathy's nanoGPT: Minimal, educational GPT implementation in ~300 lines of PyTorch. https://github.com/karpathy/nanoGPT
-
Andrej Karpathy's "Let's build GPT": Step-by-step video building GPT from scratch. https://www.youtube.com/watch?v=kCc8FmEb1nY
-
The Illustrated GPT-2 by Jay Alammar: Visual walkthrough of GPT-2's architecture and generation process. https://jalammar.github.io/illustrated-gpt2/
-
HuggingFace Text Generation Tutorial: Guide to using the
generate()API with various decoding strategies. https://huggingface.co/docs/transformers/generation_strategies
Software
-
HuggingFace Transformers: Pre-trained GPT-2 models with the
generate()API for text generation. https://huggingface.co/transformers/ -
nanoGPT by Andrej Karpathy: Training GPT-2 from scratch or fine-tuning on custom data. https://github.com/karpathy/nanoGPT
-
vLLM: High-throughput LLM serving engine with PagedAttention for efficient KV cache management. https://github.com/vllm-project/vllm
-
llama.cpp: Efficient CPU inference for decoder-only models with quantization support. https://github.com/ggerganov/llama.cpp