Chapter 19: Further Reading

Foundational Papers

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." NeurIPS 2017. The original Transformer paper. Introduced the encoder-decoder architecture built entirely from attention, establishing the blueprint for modern deep learning. https://arxiv.org/abs/1706.03762
He, K., Zhang, X., Ren, S., and Sun, J. (2016). "Deep Residual Learning for Image Recognition." CVPR 2016. Introduced residual connections that enable training very deep networks, a critical component of the Transformer. https://arxiv.org/abs/1512.03385
Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). "Layer Normalization." arXiv preprint arXiv:1607.06450. Proposed layer normalization as an alternative to batch normalization for sequence models. https://arxiv.org/abs/1607.06450

Xiong, R., Yang, Y., He, D., et al. (2020). "On Layer Normalization in the Transformer Architecture." ICML 2020. Analyzed pre-norm vs. post-norm placement, showing that pre-norm is more stable for training deep Transformers. https://arxiv.org/abs/2002.04745
Press, O. and Wolf, L. (2017). "Using the Output Embedding to Improve Language Models." EACL 2017. Demonstrated that weight tying between the input embedding and output projection reduces parameters and improves performance. https://arxiv.org/abs/1608.05859
Geva, M., Schuster, R., Berant, J., and Levy, O. (2021). "Transformer Feed-Forward Layers Are Key-Value Memories." EMNLP 2021. Showed that FFN layers act as key-value memories, with rows of $W_1$ as keys and columns of $W_2$ as values. https://arxiv.org/abs/2012.14913
Elhage, N., Nanda, N., Olsson, C., et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic. Introduced the residual stream perspective and characterized attention head behaviors as circuits. https://transformer-circuits.pub/2021/framework/

Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv. Established power-law relationships between model size, data, compute, and performance. https://arxiv.org/abs/2001.08361
Dao, T., Fu, D. Y., Ermon, S., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. IO-aware exact attention implementation reducing memory from $O(n^2)$ to $O(n)$. https://arxiv.org/abs/2205.14135
Shoeybi, M., Patwary, M., Puri, R., et al. (2019). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv. Techniques for training Transformers across multiple GPUs using tensor parallelism. https://arxiv.org/abs/1909.08053

Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). "Self-Attention with Relative Position Representations." NAACL 2018. Introduced learnable relative position biases that improve generalization. https://arxiv.org/abs/1803.02155
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv. Introduced RoPE, now standard in Llama and most modern LLMs. https://arxiv.org/abs/2104.09864
Press, O., Smith, N. A., and Lewis, M. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." ICLR 2022. ALiBi adds linear biases to attention scores for length extrapolation. https://arxiv.org/abs/2108.12409

Jurafsky, D. and Martin, J. H. (2024). Speech and Language Processing, 3rd ed. (Draft). Chapters 9--10 cover the Transformer architecture with excellent pedagogical exposition. https://web.stanford.edu/~jurafsky/slp3/
Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2023). Dive into Deep Learning. Chapter 11 provides complete, interactive implementations of the Transformer. https://d2l.ai/
Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning Publications. A hands-on guide to building GPT-style models with detailed architecture explanations.

The Illustrated Transformer by Jay Alammar: Visual walkthrough of every Transformer component. https://jalammar.github.io/illustrated-transformer/
The Annotated Transformer by Harvard NLP: Line-by-line implementation with commentary. https://nlp.seas.harvard.edu/annotated-transformer/
Andrej Karpathy's nanoGPT: A minimal, educational implementation of GPT training in ~300 lines of PyTorch. https://github.com/karpathy/nanoGPT
PyTorch Transformer tutorial: Official tutorial for using nn.Transformer and related modules. https://pytorch.org/tutorials/beginner/transformer_tutorial.html

PyTorch (torch.nn.Transformer): Built-in Transformer implementation with nn.MultiheadAttention, nn.TransformerEncoder, and nn.TransformerDecoder.
HuggingFace Transformers: Thousands of pre-trained Transformer models with a unified API. https://huggingface.co/transformers/
xformers: Memory-efficient Transformer implementations from Meta, including efficient attention kernels. https://github.com/facebookresearch/xformers