Chapter 18: Further Reading

Foundational Papers

Bahdanau, D., Cho, K., and Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015. The landmark paper that introduced additive attention for sequence-to-sequence models, demonstrating dramatic improvements in machine translation quality for long sentences. https://arxiv.org/abs/1409.0473
Luong, T., Pham, H., and Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." EMNLP 2015. Introduced multiplicative (dot-product and general) attention scoring and the distinction between global and local attention. https://arxiv.org/abs/1508.04025
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." NeurIPS 2017. The Transformer paper. Introduced scaled dot-product attention, multi-head attention, and the complete attention-only architecture that replaced RNNs. https://arxiv.org/abs/1706.03762

Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention." ACL 2019 Workshop on BlackboxNLP. Detailed analysis of attention head specialization in BERT, including positional, syntactic, and separator-attending heads. https://arxiv.org/abs/1906.04341
Jain, S. and Wallace, B. C. (2019). "Attention is not Explanation." NAACL 2019. Demonstrated that attention weights often do not correlate with gradient-based feature importance and that alternative attention distributions can produce identical predictions. https://arxiv.org/abs/1902.10186
Wiegreffe, S. and Pinter, Y. (2019). "Attention is not not Explanation." EMNLP 2019. A response to Jain and Wallace arguing that attention can provide useful explanatory information when analyzed carefully. https://arxiv.org/abs/1908.04626
Michel, P., Levy, O., and Neubile, G. (2019). "Are Sixteen Heads Really Better than One?" NeurIPS 2019. Showed that many attention heads can be pruned with little loss in performance, suggesting significant redundancy in multi-head attention. https://arxiv.org/abs/1905.10650

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. Introduced an IO-aware algorithm that computes exact attention without materializing the full $n \times n$ matrix, reducing memory from $O(n^2)$ to $O(n)$. https://arxiv.org/abs/2205.14135
Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv. Introduced sparse attention patterns (strided and fixed) that reduce attention complexity from $O(n^2)$ to $O(n\sqrt{n})$. https://arxiv.org/abs/1904.10509
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." ICML 2020. Replaced softmax with a kernel function to achieve $O(n)$ attention complexity. https://arxiv.org/abs/2006.16236
Beltagy, I., Peters, M. E., and Cohan, A. (2020). "Longformer: The Long-Document Transformer." arXiv. Combined sliding window attention with global attention tokens for efficient processing of long documents (up to 4,096 tokens). https://arxiv.org/abs/2004.05150
Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). "Rethinking Attention with Performers." ICLR 2021. Used random feature approximations to linearize softmax attention, enabling $O(n)$ computation. https://arxiv.org/abs/2009.14794

Graves, A., Wayne, G., and Danihelka, I. (2014). "Neural Turing Machines." arXiv. Introduced differentiable memory with both content-based and location-based addressing, a precursor to modern attention mechanisms. https://arxiv.org/abs/1410.5401
Weston, J., Chopra, S., and Bordes, A. (2015). "Memory Networks." ICLR 2015. Proposed storing facts as key-value pairs and using attention to retrieve relevant information for question answering.
Miller, A., Fisch, A., Dodge, J., Karber, A., Bordes, A., and Weston, J. (2016). "Key-Value Memory Networks for Directly Reading Documents." EMNLP 2016. Explicitly separated keys (for addressing) from values (for retrieval) in memory networks.

Jurafsky, D. and Martin, J. H. (2024). Speech and Language Processing, 3rd ed. (Draft). Chapter 9 covers attention in the context of RNNs and Transformers with excellent pedagogical exposition. Available at https://web.stanford.edu/~jurafsky/slp3/.
Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2023). Dive into Deep Learning. Chapter 11 provides interactive, executable implementations of all attention variants in PyTorch, TensorFlow, and JAX. Available at https://d2l.ai/.
Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning Publications. Chapters 3--4 provide from-scratch implementations of multi-head attention and the Transformer with detailed commentary. https://www.manning.com/books/build-a-large-language-model-from-scratch

The Illustrated Transformer by Jay Alammar: A widely referenced visual guide to the Transformer and attention mechanism. https://jalammar.github.io/illustrated-transformer/
The Annotated Transformer by Harvard NLP: A line-by-line PyTorch implementation of the original Transformer paper with detailed annotations. https://nlp.seas.harvard.edu/annotated-transformer/
Lilian Weng's "Attention? Attention!" Blog post providing a comprehensive survey of attention mechanisms from Bahdanau to Transformers. https://lilianweng.github.io/posts/2018-06-24-attention/

Relative positional encodings: Shaw et al. (2018) introduced learnable relative position biases that improve attention's ability to generalize across sequence lengths. See "Self-Attention with Relative Position Representations."
Rotary Positional Embeddings (RoPE): Su et al. (2021) encode position by rotating query and key vectors, producing dot products that depend only on relative position. Used in Llama and Mistral.
Multi-query and grouped-query attention: Shazeer (2019) proposed sharing keys and values across heads to reduce the KV cache size during inference. This is widely used in modern LLMs.
Sigmoid attention: Ramapuram et al. (2024) revisited replacing softmax with sigmoid in attention, showing competitive results with potential efficiency benefits.