Chapter 20: Further Reading
Foundational Papers
-
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. Introduced BERT with masked language modeling and next sentence prediction, establishing the pre-train/fine-tune paradigm. https://arxiv.org/abs/1810.04805
-
Peters, M. E., Neumann, M., Iyyer, M., et al. (2018). "Deep Contextualized Word Representations." NAACL 2018. Introduced ELMo, the first widely successful approach to contextualized word embeddings using bidirectional LSTMs. https://arxiv.org/abs/1802.05365
-
Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." ACL 2018. Introduced ULMFiT, demonstrating effective transfer learning for NLP with discriminative fine-tuning and gradual unfreezing. https://arxiv.org/abs/1801.06146
BERT Variants and Improvements
-
Liu, Y., Ott, M., Goyal, N., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv. Showed that BERT was significantly undertrained and achieved better results by removing NSP, using dynamic masking, and training on more data. https://arxiv.org/abs/1907.11692
-
Lan, Z., Chen, M., Goodman, S., et al. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." ICLR 2020. Introduced factorized embeddings and cross-layer parameter sharing for parameter-efficient pre-training. https://arxiv.org/abs/1909.11942
-
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). "DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter." NeurIPS 2019 Workshop. Applied knowledge distillation to compress BERT to 60% of its size while retaining 97% of its performance. https://arxiv.org/abs/1910.01108
-
Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." ICLR 2020. Replaced MLM with replaced token detection for more sample-efficient pre-training. https://arxiv.org/abs/2003.10555
Text-to-Text and Encoder-Decoder Models
-
Raffel, C., Shazeer, N., Roberts, A., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR 2020. Introduced T5 and systematically compared pre-training objectives, architectures, and data strategies. https://arxiv.org/abs/1910.10683
-
Lewis, M., Liu, Y., Goyal, N., et al. (2020). "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." ACL 2020. Combined bidirectional encoder with autoregressive decoder using denoising pre-training. https://arxiv.org/abs/1910.13461
Tokenization
-
Sennrich, R., Haddow, B., and Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. Introduced byte-pair encoding (BPE) for neural MT, now standard across all LLMs. https://arxiv.org/abs/1508.07909
-
Kudo, T. and Richardson, J. (2018). "SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing." EMNLP 2018. Language-agnostic tokenizer supporting BPE and unigram models. https://arxiv.org/abs/1808.06226
-
Schuster, M. and Nakajima, K. (2012). "Japanese and Korean Voice Search." ICASSP 2012. Original WordPiece paper used by BERT and related models.
Analysis and Interpretability
-
Jawahar, G., Sagot, B., and Seddah, D. (2019). "What Does BERT Learn about the Structure of Language?" ACL 2019. Probing analysis showing BERT encodes syntax in lower layers and semantics in upper layers. https://arxiv.org/abs/1905.06316
-
Tenney, I., Das, D., and Pavlick, E. (2019). "BERT Rediscovers the Classical NLP Pipeline." ACL 2019. Layer-by-layer probing showing BERT recapitulates POS, parsing, NER, and semantic roles in sequence. https://arxiv.org/abs/1905.05950
-
Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention." BlackboxNLP 2019. Analyzed attention patterns revealing syntactic and positional head specialization. https://arxiv.org/abs/1906.04341
-
Rogers, A., Kovaleva, O., and Rumshisky, A. (2021). "A Primer in BERTology: What We Know About How BERT Works." TACL 2021. Comprehensive survey of research on BERT's internal representations and behavior. https://arxiv.org/abs/2002.12327
Static Embeddings (Historical Context)
-
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." ICLR 2013 Workshop. Introduced Word2Vec (CBOW and Skip-gram). https://arxiv.org/abs/1301.3781
-
Pennington, J., Socher, R., and Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." EMNLP 2014. Word embeddings from global co-occurrence statistics. https://nlp.stanford.edu/pubs/glue2014.pdf
Benchmarks
-
Wang, A., Singh, A., Michael, J., et al. (2019). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." ICLR 2019. Standard benchmark for evaluating pre-trained language models. https://arxiv.org/abs/1804.07461
-
Wang, A., Pruksachatkun, Y., Nangia, N., et al. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." NeurIPS 2019. Harder successor to GLUE with more challenging tasks. https://arxiv.org/abs/1905.00537
Textbooks and Tutorials
-
Jurafsky, D. and Martin, J. H. (2024). Speech and Language Processing, 3rd ed. (Draft). Chapters 11--12 cover BERT, transfer learning, and fine-tuning with excellent exposition. https://web.stanford.edu/~jurafsky/slp3/
-
Tunstall, L., von Werra, L., and Wolf, T. (2022). Natural Language Processing with Transformers. O'Reilly. The definitive guide to the HuggingFace ecosystem with practical examples for every major NLP task.
Online Resources
-
HuggingFace Course: Free online course covering the transformers library, tokenizers, and fine-tuning workflows. https://huggingface.co/course
-
The Illustrated BERT by Jay Alammar: Visual guide to BERT's architecture and pre-training objectives. https://jalammar.github.io/illustrated-bert/
-
HuggingFace Model Hub: Repository of thousands of pre-trained models with model cards and usage examples. https://huggingface.co/models
Software
-
HuggingFace Transformers: Unified API for pre-trained models. https://github.com/huggingface/transformers
-
HuggingFace Tokenizers: Fast tokenizer implementations in Rust. https://github.com/huggingface/tokenizers
-
HuggingFace Datasets: Efficient dataset loading and processing. https://github.com/huggingface/datasets
-
Sentence-Transformers: Library for computing sentence embeddings using Transformer models. https://www.sbert.net/