Chapter 21: Further Reading

Foundational Papers

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI. GPT-1: demonstrated that unsupervised pre-training with a Transformer decoder followed by discriminative fine-tuning achieves strong results across NLP tasks. https://openai.com/research/language-unsupervised
Radford, A., Wu, J., Child, R., et al. (2019). "Language Models Are Unsupervised Multitask Learners." OpenAI. GPT-2: scaled up GPT-1 and introduced zero-shot task transfer, showing that a sufficiently large language model can perform tasks without fine-tuning. https://openai.com/research/better-language-models
Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models Are Few-Shot Learners." NeurIPS 2020. GPT-3: 175B-parameter model demonstrating in-context learning---performing tasks from a few examples in the prompt without gradient updates. https://arxiv.org/abs/2005.14165

Hendrycks, D. and Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv. Introduced the GELU activation function used in GPT and most modern Transformers. https://arxiv.org/abs/1606.08415
Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv. Introduced sparse attention patterns to extend context windows beyond standard quadratic attention. https://arxiv.org/abs/1904.10509
Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." NeurIPS 2019. Proposed RMSNorm, a simplified alternative to LayerNorm used in Llama and other modern LLMs. https://arxiv.org/abs/1910.07467

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. Introduced nucleus (top-p) sampling and analyzed why likelihood-based decoding produces degenerate text. https://arxiv.org/abs/1904.09751
Fan, A., Lewis, M., and Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. Introduced top-k sampling for neural text generation. https://arxiv.org/abs/1805.04833
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., and Socher, R. (2019). "CTRL: A Conditional Transformer Language Model for Controllable Generation." arXiv. Controllable generation using control codes to guide style and content. https://arxiv.org/abs/1909.05858
Li, X. L., Holtzman, A., Fried, D., et al. (2022). "Contrastive Decoding: Open-ended Text Generation as Optimization." ACL 2023. Uses the difference between large and small model logits to improve generation quality. https://arxiv.org/abs/2210.15097

Pope, R., Douglas, S., Chowdhery, A., et al. (2023). "Efficiently Scaling Transformer Inference." MLSys 2023. Analysis of memory-bound autoregressive generation and strategies for efficient deployment. https://arxiv.org/abs/2211.05102
Leviathan, Y., Kalman, M., and Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. Uses a small draft model to propose tokens verified by the large model, achieving lossless speedup. https://arxiv.org/abs/2211.17192
Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. (2023). "Efficient Streaming Language Models with Attention Sinks." ICLR 2024. Discovered that early tokens serve as "attention sinks" and proposed StreamingLLM for infinite-length generation. https://arxiv.org/abs/2309.17453

Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv. Open-weight LLM family with RoPE, RMSNorm, SwiGLU, and GQA. https://arxiv.org/abs/2302.13971
Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). "Mistral 7B." arXiv. Efficient 7B model with sliding window attention and grouped query attention. https://arxiv.org/abs/2310.06825
Anil, R., Dai, A. M., Firat, O., et al. (2023). "Gemini: A Family of Highly Capable Multimodal Models." arXiv. Google's multimodal model family built on decoder-only Transformer architecture. https://arxiv.org/abs/2312.11805

Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv. Power-law relationships between model size, data, compute, and performance. https://arxiv.org/abs/2001.08361
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022. The Chinchilla paper: showed that most LLMs were undertrained relative to their size and proposed optimal data/parameter ratios. https://arxiv.org/abs/2203.15556

Olsson, C., Elhage, N., Nanda, N., et al. (2022). "In-context Learning and Induction Heads." Anthropic. Identified induction heads as a key mechanism for in-context learning in Transformer language models. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/
Elhage, N., Nanda, N., Olsson, C., et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic. Residual stream perspective and circuit-level analysis of Transformer behavior. https://transformer-circuits.pub/2021/framework/

Jurafsky, D. and Martin, J. H. (2024). Speech and Language Processing, 3rd ed. (Draft). Chapter 10 covers autoregressive language models and the GPT architecture. https://web.stanford.edu/~jurafsky/slp3/
Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning Publications. Step-by-step GPT implementation with training and fine-tuning.
Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2023). Dive into Deep Learning. Chapter 11 includes GPT-style language model implementations. https://d2l.ai/

Andrej Karpathy's nanoGPT: Minimal, educational GPT implementation in ~300 lines of PyTorch. https://github.com/karpathy/nanoGPT
Andrej Karpathy's "Let's build GPT": Step-by-step video building GPT from scratch. https://www.youtube.com/watch?v=kCc8FmEb1nY
The Illustrated GPT-2 by Jay Alammar: Visual walkthrough of GPT-2's architecture and generation process. https://jalammar.github.io/illustrated-gpt2/
HuggingFace Text Generation Tutorial: Guide to using the generate() API with various decoding strategies. https://huggingface.co/docs/transformers/generation_strategies

HuggingFace Transformers: Pre-trained GPT-2 models with the generate() API for text generation. https://huggingface.co/transformers/
nanoGPT by Andrej Karpathy: Training GPT-2 from scratch or fine-tuning on custom data. https://github.com/karpathy/nanoGPT
vLLM: High-throughput LLM serving engine with PagedAttention for efficient KV cache management. https://github.com/vllm-project/vllm
llama.cpp: Efficient CPU inference for decoder-only models with quantization support. https://github.com/ggerganov/llama.cpp