Chapter 20: Further Reading

Foundational Papers

  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. Introduced BERT with masked language modeling and next sentence prediction, establishing the pre-train/fine-tune paradigm. https://arxiv.org/abs/1810.04805

  • Peters, M. E., Neumann, M., Iyyer, M., et al. (2018). "Deep Contextualized Word Representations." NAACL 2018. Introduced ELMo, the first widely successful approach to contextualized word embeddings using bidirectional LSTMs. https://arxiv.org/abs/1802.05365

  • Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." ACL 2018. Introduced ULMFiT, demonstrating effective transfer learning for NLP with discriminative fine-tuning and gradual unfreezing. https://arxiv.org/abs/1801.06146

BERT Variants and Improvements

  • Liu, Y., Ott, M., Goyal, N., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv. Showed that BERT was significantly undertrained and achieved better results by removing NSP, using dynamic masking, and training on more data. https://arxiv.org/abs/1907.11692

  • Lan, Z., Chen, M., Goodman, S., et al. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." ICLR 2020. Introduced factorized embeddings and cross-layer parameter sharing for parameter-efficient pre-training. https://arxiv.org/abs/1909.11942

  • Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). "DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter." NeurIPS 2019 Workshop. Applied knowledge distillation to compress BERT to 60% of its size while retaining 97% of its performance. https://arxiv.org/abs/1910.01108

  • Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." ICLR 2020. Replaced MLM with replaced token detection for more sample-efficient pre-training. https://arxiv.org/abs/2003.10555

Text-to-Text and Encoder-Decoder Models

  • Raffel, C., Shazeer, N., Roberts, A., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR 2020. Introduced T5 and systematically compared pre-training objectives, architectures, and data strategies. https://arxiv.org/abs/1910.10683

  • Lewis, M., Liu, Y., Goyal, N., et al. (2020). "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." ACL 2020. Combined bidirectional encoder with autoregressive decoder using denoising pre-training. https://arxiv.org/abs/1910.13461

Tokenization

  • Sennrich, R., Haddow, B., and Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. Introduced byte-pair encoding (BPE) for neural MT, now standard across all LLMs. https://arxiv.org/abs/1508.07909

  • Kudo, T. and Richardson, J. (2018). "SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing." EMNLP 2018. Language-agnostic tokenizer supporting BPE and unigram models. https://arxiv.org/abs/1808.06226

  • Schuster, M. and Nakajima, K. (2012). "Japanese and Korean Voice Search." ICASSP 2012. Original WordPiece paper used by BERT and related models.

Analysis and Interpretability

  • Jawahar, G., Sagot, B., and Seddah, D. (2019). "What Does BERT Learn about the Structure of Language?" ACL 2019. Probing analysis showing BERT encodes syntax in lower layers and semantics in upper layers. https://arxiv.org/abs/1905.06316

  • Tenney, I., Das, D., and Pavlick, E. (2019). "BERT Rediscovers the Classical NLP Pipeline." ACL 2019. Layer-by-layer probing showing BERT recapitulates POS, parsing, NER, and semantic roles in sequence. https://arxiv.org/abs/1905.05950

  • Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention." BlackboxNLP 2019. Analyzed attention patterns revealing syntactic and positional head specialization. https://arxiv.org/abs/1906.04341

  • Rogers, A., Kovaleva, O., and Rumshisky, A. (2021). "A Primer in BERTology: What We Know About How BERT Works." TACL 2021. Comprehensive survey of research on BERT's internal representations and behavior. https://arxiv.org/abs/2002.12327

Static Embeddings (Historical Context)

  • Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." ICLR 2013 Workshop. Introduced Word2Vec (CBOW and Skip-gram). https://arxiv.org/abs/1301.3781

  • Pennington, J., Socher, R., and Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." EMNLP 2014. Word embeddings from global co-occurrence statistics. https://nlp.stanford.edu/pubs/glue2014.pdf

Benchmarks

  • Wang, A., Singh, A., Michael, J., et al. (2019). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." ICLR 2019. Standard benchmark for evaluating pre-trained language models. https://arxiv.org/abs/1804.07461

  • Wang, A., Pruksachatkun, Y., Nangia, N., et al. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." NeurIPS 2019. Harder successor to GLUE with more challenging tasks. https://arxiv.org/abs/1905.00537

Textbooks and Tutorials

  • Jurafsky, D. and Martin, J. H. (2024). Speech and Language Processing, 3rd ed. (Draft). Chapters 11--12 cover BERT, transfer learning, and fine-tuning with excellent exposition. https://web.stanford.edu/~jurafsky/slp3/

  • Tunstall, L., von Werra, L., and Wolf, T. (2022). Natural Language Processing with Transformers. O'Reilly. The definitive guide to the HuggingFace ecosystem with practical examples for every major NLP task.

Online Resources

  • HuggingFace Course: Free online course covering the transformers library, tokenizers, and fine-tuning workflows. https://huggingface.co/course

  • The Illustrated BERT by Jay Alammar: Visual guide to BERT's architecture and pre-training objectives. https://jalammar.github.io/illustrated-bert/

  • HuggingFace Model Hub: Repository of thousands of pre-trained models with model cards and usage examples. https://huggingface.co/models

Software

  • HuggingFace Transformers: Unified API for pre-trained models. https://github.com/huggingface/transformers

  • HuggingFace Tokenizers: Fast tokenizer implementations in Rust. https://github.com/huggingface/tokenizers

  • HuggingFace Datasets: Efficient dataset loading and processing. https://github.com/huggingface/datasets

  • Sentence-Transformers: Library for computing sentence embeddings using Transformer models. https://www.sbert.net/