Chapter 20 Key Takeaways

The Big Picture

Pre-training and transfer learning transformed NLP from a field of task-specific architectures into one where a single model, trained on massive unlabeled text, can be adapted to virtually any downstream language task with minimal labeled data. BERT demonstrated that bidirectional Transformer encoders, pre-trained with masked language modeling, produce rich contextual representations that transfer across tasks. The HuggingFace ecosystem made this accessible to every practitioner.


From Static to Contextualized Embeddings

  • Static embeddings (Word2Vec, GloVe) assign one vector per word type, losing context-dependent meaning (the polysemy problem).
  • ELMo introduced contextualized embeddings via bidirectional LSTMs, producing different representations for the same word in different contexts.
  • BERT advanced this with deep bidirectional Transformer encoders, enabling each token to attend to all other tokens simultaneously.

BERT Architecture and Pre-training

Architecture

  • Encoder-only Transformer (12 or 24 layers).
  • Input = token embedding + segment embedding + learned positional embedding.
  • Special tokens: [CLS] for classification, [SEP] for sentence boundaries.

Pre-training Objectives

  • Masked Language Modeling (MLM): Randomly mask 15% of tokens (80% [MASK], 10% random, 10% unchanged) and predict them. Enables bidirectional context without information leakage.
  • Next Sentence Prediction (NSP): Binary classification of whether sentence B follows sentence A. Later shown to be unnecessary by RoBERTa.

Key Numbers

Parameter BERT-Base BERT-Large
Layers 12 24
Hidden size 768 1024
Attention heads 12 16
Parameters 110M 340M

Tokenization

Algorithm Used By Merge Criterion Special Prefix
BPE GPT-2, RoBERTa Most frequent pair Space prefix
WordPiece BERT, DistilBERT Maximum likelihood ## continuation
SentencePiece T5, ALBERT Unigram LM or BPE Language-agnostic
  • Subword tokenization balances vocabulary size, sequence length, and the ability to handle rare/novel words.
  • Tokenizer choice affects downstream performance and must match the pre-trained model.

The HuggingFace Ecosystem

  • transformers: Unified API for thousands of pre-trained models (AutoModel, AutoTokenizer, Trainer).
  • tokenizers: Fast Rust-backed tokenizer implementations.
  • datasets: Efficient loading of NLP benchmarks with memory-mapped storage.
  • Auto classes inspect model configs and automatically instantiate the correct architecture.

Fine-Tuning vs. Feature Extraction

Strategy What Is Updated When to Use
Full fine-tuning All model + head parameters Large labeled dataset, sufficient compute
Feature extraction Only the classifier head Small dataset, rapid prototyping, multi-task
Partial fine-tuning Upper layers + head Moderate data, domain shift

Fine-Tuning Hyperparameters

  • Learning rate: 2e-5 to 5e-5 (much smaller than training from scratch)
  • Epochs: 2--4
  • Warmup: 10% of steps
  • Weight decay: 0.01
  • Critical: Too-high learning rate causes catastrophic forgetting.

BERT's Learned Representations

  • Lower layers (1--4): Surface-level features---POS tags, simple syntax.
  • Middle layers (5--8): Syntactic structure---parse trees, dependencies.
  • Upper layers (9--12): Semantic features---coreference, entity types.
  • Individual attention heads specialize in syntactic relations, positional patterns, and coreference.
  • Mean pooling of all tokens often outperforms [CLS] for sentence-level similarity tasks.

BERT Variants

Model Key Innovation Parameters Trade-off
RoBERTa Removes NSP, dynamic masking, more data, larger batches 125M Better performance, same architecture
ALBERT Factorized embeddings ($V \times E + E \times H$), cross-layer parameter sharing 12M Fewer params, comparable accuracy, slower inference
DistilBERT Knowledge distillation (6 layers from 12) 66M 60% size, 97% performance, 60% faster

T5: Text-to-Text Transfer Transformer

  • Casts every NLP task as text-to-text: input is a task prefix + text, output is the answer string.
  • Encoder-decoder architecture with relative position biases.
  • Span corruption pre-training: mask contiguous spans and predict them.
  • Examples:
  • Sentiment: "sst2 sentence: great movie" -> "positive"
  • Translation: "translate English to German: Hello" -> "Hallo"
  • Summarization: "summarize: [article text]" -> "[summary]"

Practical Insights

  1. Start with the largest model that fits your compute budget, then consider distillation for deployment.
  2. Match the tokenizer to the model---never mix tokenizers from different pre-trained models.
  3. Domain-specific pre-training (further pre-training on domain text before fine-tuning) can significantly improve results for specialized domains.
  4. Gradient accumulation enables effective large batch sizes on limited GPU memory.
  5. Learning rate warmup is critical for stable fine-tuning of pre-trained models.

Common Pitfalls

  1. Using the wrong tokenizer: Each model has a specific tokenizer; mixing them produces garbage inputs.
  2. Too-high learning rate: Destroys pre-trained representations (catastrophic forgetting).
  3. Too many fine-tuning epochs: Overfits to the small downstream dataset.
  4. Ignoring tokenization artifacts: Subword tokenization can split words in unexpected ways that affect token-level tasks like NER.
  5. Comparing models without controlling for tokenization: Different tokenizers produce different sequence lengths, affecting fair comparison.

Looking Ahead

  • Chapter 21: Decoder-only models (GPT family) and autoregressive text generation.
  • Chapter 22: Scaling laws and the path from millions to billions of parameters.
  • Chapter 23: Fine-tuning with human feedback (RLHF) and instruction tuning.