Chapter 19: Further Reading

Foundational Papers

  • Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." NeurIPS 2017. The original Transformer paper. Introduced the encoder-decoder architecture built entirely from attention, establishing the blueprint for modern deep learning. https://arxiv.org/abs/1706.03762

  • He, K., Zhang, X., Ren, S., and Sun, J. (2016). "Deep Residual Learning for Image Recognition." CVPR 2016. Introduced residual connections that enable training very deep networks, a critical component of the Transformer. https://arxiv.org/abs/1512.03385

  • Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). "Layer Normalization." arXiv preprint arXiv:1607.06450. Proposed layer normalization as an alternative to batch normalization for sequence models. https://arxiv.org/abs/1607.06450

Architecture Variants and Analysis

  • Xiong, R., Yang, Y., He, D., et al. (2020). "On Layer Normalization in the Transformer Architecture." ICML 2020. Analyzed pre-norm vs. post-norm placement, showing that pre-norm is more stable for training deep Transformers. https://arxiv.org/abs/2002.04745

  • Press, O. and Wolf, L. (2017). "Using the Output Embedding to Improve Language Models." EACL 2017. Demonstrated that weight tying between the input embedding and output projection reduces parameters and improves performance. https://arxiv.org/abs/1608.05859

  • Geva, M., Schuster, R., Berant, J., and Levy, O. (2021). "Transformer Feed-Forward Layers Are Key-Value Memories." EMNLP 2021. Showed that FFN layers act as key-value memories, with rows of $W_1$ as keys and columns of $W_2$ as values. https://arxiv.org/abs/2012.14913

  • Elhage, N., Nanda, N., Olsson, C., et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic. Introduced the residual stream perspective and characterized attention head behaviors as circuits. https://transformer-circuits.pub/2021/framework/

Scaling and Efficiency

  • Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv. Established power-law relationships between model size, data, compute, and performance. https://arxiv.org/abs/2001.08361

  • Dao, T., Fu, D. Y., Ermon, S., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. IO-aware exact attention implementation reducing memory from $O(n^2)$ to $O(n)$. https://arxiv.org/abs/2205.14135

  • Shoeybi, M., Patwary, M., Puri, R., et al. (2019). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv. Techniques for training Transformers across multiple GPUs using tensor parallelism. https://arxiv.org/abs/1909.08053

Positional Encoding

  • Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). "Self-Attention with Relative Position Representations." NAACL 2018. Introduced learnable relative position biases that improve generalization. https://arxiv.org/abs/1803.02155

  • Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv. Introduced RoPE, now standard in Llama and most modern LLMs. https://arxiv.org/abs/2104.09864

  • Press, O., Smith, N. A., and Lewis, M. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." ICLR 2022. ALiBi adds linear biases to attention scores for length extrapolation. https://arxiv.org/abs/2108.12409

Textbooks and Tutorials

  • Jurafsky, D. and Martin, J. H. (2024). Speech and Language Processing, 3rd ed. (Draft). Chapters 9--10 cover the Transformer architecture with excellent pedagogical exposition. https://web.stanford.edu/~jurafsky/slp3/

  • Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2023). Dive into Deep Learning. Chapter 11 provides complete, interactive implementations of the Transformer. https://d2l.ai/

  • Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning Publications. A hands-on guide to building GPT-style models with detailed architecture explanations.

Online Resources

  • The Illustrated Transformer by Jay Alammar: Visual walkthrough of every Transformer component. https://jalammar.github.io/illustrated-transformer/

  • The Annotated Transformer by Harvard NLP: Line-by-line implementation with commentary. https://nlp.seas.harvard.edu/annotated-transformer/

  • Andrej Karpathy's nanoGPT: A minimal, educational implementation of GPT training in ~300 lines of PyTorch. https://github.com/karpathy/nanoGPT

  • PyTorch Transformer tutorial: Official tutorial for using nn.Transformer and related modules. https://pytorch.org/tutorials/beginner/transformer_tutorial.html

Software

  • PyTorch (torch.nn.Transformer): Built-in Transformer implementation with nn.MultiheadAttention, nn.TransformerEncoder, and nn.TransformerDecoder.

  • HuggingFace Transformers: Thousands of pre-trained Transformer models with a unified API. https://huggingface.co/transformers/

  • xformers: Memory-efficient Transformer implementations from Meta, including efficient attention kernels. https://github.com/facebookresearch/xformers