Chapter 20: Further Reading

Foundational Papers

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. Introduced BERT with masked language modeling and next sentence prediction, establishing the pre-train/fine-tune paradigm. https://arxiv.org/abs/1810.04805
Peters, M. E., Neumann, M., Iyyer, M., et al. (2018). "Deep Contextualized Word Representations." NAACL 2018. Introduced ELMo, the first widely successful approach to contextualized word embeddings using bidirectional LSTMs. https://arxiv.org/abs/1802.05365
Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." ACL 2018. Introduced ULMFiT, demonstrating effective transfer learning for NLP with discriminative fine-tuning and gradual unfreezing. https://arxiv.org/abs/1801.06146

Liu, Y., Ott, M., Goyal, N., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv. Showed that BERT was significantly undertrained and achieved better results by removing NSP, using dynamic masking, and training on more data. https://arxiv.org/abs/1907.11692
Lan, Z., Chen, M., Goodman, S., et al. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." ICLR 2020. Introduced factorized embeddings and cross-layer parameter sharing for parameter-efficient pre-training. https://arxiv.org/abs/1909.11942
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). "DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter." NeurIPS 2019 Workshop. Applied knowledge distillation to compress BERT to 60% of its size while retaining 97% of its performance. https://arxiv.org/abs/1910.01108
Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." ICLR 2020. Replaced MLM with replaced token detection for more sample-efficient pre-training. https://arxiv.org/abs/2003.10555

Raffel, C., Shazeer, N., Roberts, A., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR 2020. Introduced T5 and systematically compared pre-training objectives, architectures, and data strategies. https://arxiv.org/abs/1910.10683
Lewis, M., Liu, Y., Goyal, N., et al. (2020). "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." ACL 2020. Combined bidirectional encoder with autoregressive decoder using denoising pre-training. https://arxiv.org/abs/1910.13461

Sennrich, R., Haddow, B., and Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. Introduced byte-pair encoding (BPE) for neural MT, now standard across all LLMs. https://arxiv.org/abs/1508.07909
Kudo, T. and Richardson, J. (2018). "SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing." EMNLP 2018. Language-agnostic tokenizer supporting BPE and unigram models. https://arxiv.org/abs/1808.06226
Schuster, M. and Nakajima, K. (2012). "Japanese and Korean Voice Search." ICASSP 2012. Original WordPiece paper used by BERT and related models.

Jawahar, G., Sagot, B., and Seddah, D. (2019). "What Does BERT Learn about the Structure of Language?" ACL 2019. Probing analysis showing BERT encodes syntax in lower layers and semantics in upper layers. https://arxiv.org/abs/1905.06316
Tenney, I., Das, D., and Pavlick, E. (2019). "BERT Rediscovers the Classical NLP Pipeline." ACL 2019. Layer-by-layer probing showing BERT recapitulates POS, parsing, NER, and semantic roles in sequence. https://arxiv.org/abs/1905.05950
Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention." BlackboxNLP 2019. Analyzed attention patterns revealing syntactic and positional head specialization. https://arxiv.org/abs/1906.04341
Rogers, A., Kovaleva, O., and Rumshisky, A. (2021). "A Primer in BERTology: What We Know About How BERT Works." TACL 2021. Comprehensive survey of research on BERT's internal representations and behavior. https://arxiv.org/abs/2002.12327

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." ICLR 2013 Workshop. Introduced Word2Vec (CBOW and Skip-gram). https://arxiv.org/abs/1301.3781
Pennington, J., Socher, R., and Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." EMNLP 2014. Word embeddings from global co-occurrence statistics. https://nlp.stanford.edu/pubs/glue2014.pdf

Wang, A., Singh, A., Michael, J., et al. (2019). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." ICLR 2019. Standard benchmark for evaluating pre-trained language models. https://arxiv.org/abs/1804.07461
Wang, A., Pruksachatkun, Y., Nangia, N., et al. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." NeurIPS 2019. Harder successor to GLUE with more challenging tasks. https://arxiv.org/abs/1905.00537

Jurafsky, D. and Martin, J. H. (2024). Speech and Language Processing, 3rd ed. (Draft). Chapters 11--12 cover BERT, transfer learning, and fine-tuning with excellent exposition. https://web.stanford.edu/~jurafsky/slp3/
Tunstall, L., von Werra, L., and Wolf, T. (2022). Natural Language Processing with Transformers. O'Reilly. The definitive guide to the HuggingFace ecosystem with practical examples for every major NLP task.

HuggingFace Course: Free online course covering the transformers library, tokenizers, and fine-tuning workflows. https://huggingface.co/course
The Illustrated BERT by Jay Alammar: Visual guide to BERT's architecture and pre-training objectives. https://jalammar.github.io/illustrated-bert/
HuggingFace Model Hub: Repository of thousands of pre-trained models with model cards and usage examples. https://huggingface.co/models

HuggingFace Transformers: Unified API for pre-trained models. https://github.com/huggingface/transformers
HuggingFace Tokenizers: Fast tokenizer implementations in Rust. https://github.com/huggingface/tokenizers
HuggingFace Datasets: Efficient dataset loading and processing. https://github.com/huggingface/datasets
Sentence-Transformers: Library for computing sentence embeddings using Transformer models. https://www.sbert.net/