Chapter 10: Further Reading

Essential Sources

1. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, "Attention Is All You Need" (NeurIPS, 2017)

The paper that introduced the transformer. Vaswani et al. demonstrated that an architecture built entirely on attention mechanisms — with no recurrence and no convolution — could achieve state-of-the-art results on machine translation (28.4 BLEU on WMT 2014 English-to-German, 41.8 BLEU on English-to-French), while being significantly more parallelizable than RNN-based models. The paper introduced scaled dot-product attention, multi-head attention, sinusoidal positional encoding, and the encoder-decoder transformer architecture that has become the foundation of modern deep learning.

Reading guidance: Begin with Section 3 (Model Architecture), which defines every component with mathematical precision. Section 3.2.1 (Scaled Dot-Product Attention) and 3.2.2 (Multi-Head Attention) are the core contributions — read these carefully and verify every equation against your own implementation. Section 4 (Why Self-Attention) is the intellectual argument for the architecture: the comparison table of complexity, sequential operations, and maximum path length is one of the most cited tables in machine learning. Section 3.5 (Positional Encoding) is brief but important — the authors note that learned positional embeddings produced "nearly identical results" to sinusoidal encodings. The training details in Section 5.3 (learning rate warmup schedule, label smoothing) are practical and widely adopted. For a follow-up that improves training stability, see Xiong et al., "On Layer Normalization in the Transformer Architecture" (ICML, 2020), which provides the theoretical justification for pre-LN.

2. Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS, 2022)

This paper reframed attention computation as a memory-hierarchy problem rather than a FLOP-count problem. Dao et al. showed that naive attention is memory-bound on modern GPUs — the bottleneck is transferring the $O(n^2)$ attention matrix between HBM and SRAM, not computing it. FlashAttention computes exact attention (not an approximation) by tiling the computation into SRAM-sized blocks and using the online softmax algorithm to avoid materializing the full attention matrix. The result is 2-4x wall-clock speedup and 5-20x memory reduction, enabling context windows of tens of thousands of tokens.

Reading guidance: Section 1 (Introduction) clearly states the memory hierarchy argument — read this first to understand why the paper exists. Section 3.1 (Standard Attention Implementation) and 3.2 (FlashAttention Algorithm) present the tiled computation with the online softmax trick. The key insight is Algorithm 1, which maintains running statistics ($m_i$, $\ell_i$, $O_i$) across blocks of K/V. If you find the algorithm dense, start by implementing the 1D online softmax (Exercise 10.12 in this chapter) to build intuition. Section 4 (Experiments) demonstrates that FlashAttention enables training GPT-2 with 4x longer context (4K tokens) at the same wall-clock time as standard attention with 1K tokens. The follow-up paper, Dao (2023), "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning," provides further optimizations and is implemented in PyTorch 2.0+ via F.scaled_dot_product_attention.

3. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, "An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ICLR, 2021)

The Vision Transformer (ViT) demonstrated that pure transformers — with no convolutional layers — can match or exceed CNN performance on image classification when pretrained on sufficient data (JFT-300M, 303 million images). The paper challenged the prevailing assumption that convolution's inductive biases (locality, translation equivariance) were necessary for vision. The key architectural idea is simple: split the image into fixed-size patches, embed each patch linearly, add positional embeddings, and process with a standard transformer encoder.

Reading guidance: Section 3 (Method) is concise and clear — the entire ViT architecture is described in one page. The critical experimental finding is in Section 4.4 (Scaling Study): ViT underperforms CNNs when trained on small datasets (ImageNet-1K) but surpasses them when pretrained on large datasets (ImageNet-21K, JFT-300M). This confirms the transformer's trade-off: it lacks the inductive bias that helps CNNs learn from limited data, but its greater expressiveness wins when data is abundant. Figure 7 (attention distance vs. depth) shows that lower layers learn local attention patterns (similar to convolution) while upper layers learn global patterns — the transformer discovers locality from data rather than building it in. For an analysis of what ViTs learn internally, see Raghu et al., "Do Vision Transformers See Like Convolutional Neural Networks?" (NeurIPS, 2021).

4. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah, "A Mathematical Framework for Transformer Circuits" (Anthropic, 2021)

This paper provides a mechanistic interpretability framework for understanding transformer computation. The key insight is the "residual stream" view: the transformer is not a sequential pipeline but a broadcasting architecture where each layer reads from and writes to a shared communication channel (the residual stream). Attention heads move information between positions, while MLP layers store and retrieve factual knowledge. The paper demonstrates that small transformers can be fully reverse-engineered, with individual heads performing identifiable computations (induction heads, duplicate token heads, etc.).

Reading guidance: Start with the "Summary of Results" section, which provides the key claims without the full mathematical machinery. The residual stream formalism (Section 2) is the most important conceptual contribution — once you see the transformer as reads/writes to a shared channel, many architectural choices (pre-LN, skip connections, the role of FFN) become intuitive. Section 4 (Induction Heads) demonstrates a concrete circuit: two attention heads across two layers that implement in-context learning by completing patterns of the form [A][B] ... [A] → [B]. This paper is valuable not only for its specific findings but as a model for how to think about what transformers compute, rather than just what they predict.

5. Iz Beltagy, Matthew E. Peters, and Arman Cohan, "Longformer: The Long-Document Transformer" (arXiv, 2020)

The Longformer addresses the $O(n^2)$ complexity bottleneck by combining sliding-window attention (each position attends to $w$ local positions) with global attention (designated tokens attend to all positions). This reduces complexity to $O(n \cdot w)$ while maintaining the ability to propagate information across the full sequence through global tokens. The paper demonstrates strong performance on long-document tasks including classification, question answering, and coreference resolution on documents up to 4,096 tokens.

Reading guidance: Section 3 defines the three attention patterns: sliding window, dilated sliding window, and global attention. The key design decision is which tokens receive global attention — for classification, only the [CLS] token; for question answering, all question tokens attend globally. Section 4 (Experiments) compares Longformer against RoBERTa (which truncates long documents) and demonstrates that processing the full document — rather than truncating to 512 tokens — improves performance on all tasks. For an alternative approach to efficient attention, see Kitaev et al., "Reformer: The Efficient Transformer" (ICLR, 2020), which uses locality-sensitive hashing to reduce attention complexity to $O(n \log n)$, and Zaheer et al., "Big Bird: Transformers for Longer Sequences" (NeurIPS, 2020), which provides theoretical analysis of sparse attention's approximation properties.