Chapter 18: Further Reading
Foundational Papers
-
Bahdanau, D., Cho, K., and Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015. The landmark paper that introduced additive attention for sequence-to-sequence models, demonstrating dramatic improvements in machine translation quality for long sentences. https://arxiv.org/abs/1409.0473
-
Luong, T., Pham, H., and Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." EMNLP 2015. Introduced multiplicative (dot-product and general) attention scoring and the distinction between global and local attention. https://arxiv.org/abs/1508.04025
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." NeurIPS 2017. The Transformer paper. Introduced scaled dot-product attention, multi-head attention, and the complete attention-only architecture that replaced RNNs. https://arxiv.org/abs/1706.03762
Attention Analysis and Interpretation
-
Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention." ACL 2019 Workshop on BlackboxNLP. Detailed analysis of attention head specialization in BERT, including positional, syntactic, and separator-attending heads. https://arxiv.org/abs/1906.04341
-
Jain, S. and Wallace, B. C. (2019). "Attention is not Explanation." NAACL 2019. Demonstrated that attention weights often do not correlate with gradient-based feature importance and that alternative attention distributions can produce identical predictions. https://arxiv.org/abs/1902.10186
-
Wiegreffe, S. and Pinter, Y. (2019). "Attention is not not Explanation." EMNLP 2019. A response to Jain and Wallace arguing that attention can provide useful explanatory information when analyzed carefully. https://arxiv.org/abs/1908.04626
-
Michel, P., Levy, O., and Neubile, G. (2019). "Are Sixteen Heads Really Better than One?" NeurIPS 2019. Showed that many attention heads can be pruned with little loss in performance, suggesting significant redundancy in multi-head attention. https://arxiv.org/abs/1905.10650
Efficient Attention Variants
-
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. Introduced an IO-aware algorithm that computes exact attention without materializing the full $n \times n$ matrix, reducing memory from $O(n^2)$ to $O(n)$. https://arxiv.org/abs/2205.14135
-
Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv. Introduced sparse attention patterns (strided and fixed) that reduce attention complexity from $O(n^2)$ to $O(n\sqrt{n})$. https://arxiv.org/abs/1904.10509
-
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." ICML 2020. Replaced softmax with a kernel function to achieve $O(n)$ attention complexity. https://arxiv.org/abs/2006.16236
-
Beltagy, I., Peters, M. E., and Cohan, A. (2020). "Longformer: The Long-Document Transformer." arXiv. Combined sliding window attention with global attention tokens for efficient processing of long documents (up to 4,096 tokens). https://arxiv.org/abs/2004.05150
-
Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2021). "Rethinking Attention with Performers." ICLR 2021. Used random feature approximations to linearize softmax attention, enabling $O(n)$ computation. https://arxiv.org/abs/2009.14794
Attention and Memory Networks
-
Graves, A., Wayne, G., and Danihelka, I. (2014). "Neural Turing Machines." arXiv. Introduced differentiable memory with both content-based and location-based addressing, a precursor to modern attention mechanisms. https://arxiv.org/abs/1410.5401
-
Weston, J., Chopra, S., and Bordes, A. (2015). "Memory Networks." ICLR 2015. Proposed storing facts as key-value pairs and using attention to retrieve relevant information for question answering.
-
Miller, A., Fisch, A., Dodge, J., Karber, A., Bordes, A., and Weston, J. (2016). "Key-Value Memory Networks for Directly Reading Documents." EMNLP 2016. Explicitly separated keys (for addressing) from values (for retrieval) in memory networks.
Textbooks and Tutorials
-
Jurafsky, D. and Martin, J. H. (2024). Speech and Language Processing, 3rd ed. (Draft). Chapter 9 covers attention in the context of RNNs and Transformers with excellent pedagogical exposition. Available at https://web.stanford.edu/~jurafsky/slp3/.
-
Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2023). Dive into Deep Learning. Chapter 11 provides interactive, executable implementations of all attention variants in PyTorch, TensorFlow, and JAX. Available at https://d2l.ai/.
-
Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning Publications. Chapters 3--4 provide from-scratch implementations of multi-head attention and the Transformer with detailed commentary. https://www.manning.com/books/build-a-large-language-model-from-scratch
Online Resources
-
The Illustrated Transformer by Jay Alammar: A widely referenced visual guide to the Transformer and attention mechanism. https://jalammar.github.io/illustrated-transformer/
-
The Annotated Transformer by Harvard NLP: A line-by-line PyTorch implementation of the original Transformer paper with detailed annotations. https://nlp.seas.harvard.edu/annotated-transformer/
-
Lilian Weng's "Attention? Attention!" Blog post providing a comprehensive survey of attention mechanisms from Bahdanau to Transformers. https://lilianweng.github.io/posts/2018-06-24-attention/
Advanced Topics for Further Study
-
Relative positional encodings: Shaw et al. (2018) introduced learnable relative position biases that improve attention's ability to generalize across sequence lengths. See "Self-Attention with Relative Position Representations."
-
Rotary Positional Embeddings (RoPE): Su et al. (2021) encode position by rotating query and key vectors, producing dot products that depend only on relative position. Used in Llama and Mistral.
-
Multi-query and grouped-query attention: Shazeer (2019) proposed sharing keys and values across heads to reduce the KV cache size during inference. This is widely used in modern LLMs.
-
Sigmoid attention: Ramapuram et al. (2024) revisited replacing softmax with sigmoid in attention, showing competitive results with potential efficiency benefits.