Chapter 20 Key Takeaways

The Big Picture

Pre-training and transfer learning transformed NLP from a field of task-specific architectures into one where a single model, trained on massive unlabeled text, can be adapted to virtually any downstream language task with minimal labeled data. BERT demonstrated that bidirectional Transformer encoders, pre-trained with masked language modeling, produce rich contextual representations that transfer across tasks. The HuggingFace ecosystem made this accessible to every practitioner.

From Static to Contextualized Embeddings

Static embeddings (Word2Vec, GloVe) assign one vector per word type, losing context-dependent meaning (the polysemy problem).
ELMo introduced contextualized embeddings via bidirectional LSTMs, producing different representations for the same word in different contexts.
BERT advanced this with deep bidirectional Transformer encoders, enabling each token to attend to all other tokens simultaneously.

BERT Architecture and Pre-training

Architecture

Encoder-only Transformer (12 or 24 layers).
Input = token embedding + segment embedding + learned positional embedding.
Special tokens: [CLS] for classification, [SEP] for sentence boundaries.

Pre-training Objectives

Masked Language Modeling (MLM): Randomly mask 15% of tokens (80% [MASK], 10% random, 10% unchanged) and predict them. Enables bidirectional context without information leakage.
Next Sentence Prediction (NSP): Binary classification of whether sentence B follows sentence A. Later shown to be unnecessary by RoBERTa.

Key Numbers

Parameter	BERT-Base	BERT-Large
Layers	12	24
Hidden size	768	1024
Attention heads	12	16
Parameters	110M	340M

Tokenization

Algorithm	Used By	Merge Criterion	Special Prefix
BPE	GPT-2, RoBERTa	Most frequent pair	Space prefix
WordPiece	BERT, DistilBERT	Maximum likelihood	`##` continuation
SentencePiece	T5, ALBERT	Unigram LM or BPE	Language-agnostic

Subword tokenization balances vocabulary size, sequence length, and the ability to handle rare/novel words.
Tokenizer choice affects downstream performance and must match the pre-trained model.

The HuggingFace Ecosystem

transformers: Unified API for thousands of pre-trained models (AutoModel, AutoTokenizer, Trainer).
tokenizers: Fast Rust-backed tokenizer implementations.
datasets: Efficient loading of NLP benchmarks with memory-mapped storage.
Auto classes inspect model configs and automatically instantiate the correct architecture.

Fine-Tuning vs. Feature Extraction

Strategy	What Is Updated	When to Use
Full fine-tuning	All model + head parameters	Large labeled dataset, sufficient compute
Feature extraction	Only the classifier head	Small dataset, rapid prototyping, multi-task
Partial fine-tuning	Upper layers + head	Moderate data, domain shift

Fine-Tuning Hyperparameters

Learning rate: 2e-5 to 5e-5 (much smaller than training from scratch)
Epochs: 2--4
Warmup: 10% of steps
Weight decay: 0.01
Critical: Too-high learning rate causes catastrophic forgetting.

BERT's Learned Representations

Lower layers (1--4): Surface-level features---POS tags, simple syntax.
Middle layers (5--8): Syntactic structure---parse trees, dependencies.
Upper layers (9--12): Semantic features---coreference, entity types.
Individual attention heads specialize in syntactic relations, positional patterns, and coreference.
Mean pooling of all tokens often outperforms [CLS] for sentence-level similarity tasks.

BERT Variants

Model	Key Innovation	Parameters	Trade-off
RoBERTa	Removes NSP, dynamic masking, more data, larger batches	125M	Better performance, same architecture
ALBERT	Factorized embeddings ($V \times E + E \times H$), cross-layer parameter sharing	12M	Fewer params, comparable accuracy, slower inference
DistilBERT	Knowledge distillation (6 layers from 12)	66M	60% size, 97% performance, 60% faster

T5: Text-to-Text Transfer Transformer

Casts every NLP task as text-to-text: input is a task prefix + text, output is the answer string.
Encoder-decoder architecture with relative position biases.
Span corruption pre-training: mask contiguous spans and predict them.
Examples:
Sentiment: "sst2 sentence: great movie" -> "positive"
Translation: "translate English to German: Hello" -> "Hallo"
Summarization: "summarize: [article text]" -> "[summary]"

Practical Insights

Start with the largest model that fits your compute budget, then consider distillation for deployment.
Match the tokenizer to the model---never mix tokenizers from different pre-trained models.
Domain-specific pre-training (further pre-training on domain text before fine-tuning) can significantly improve results for specialized domains.
Gradient accumulation enables effective large batch sizes on limited GPU memory.
Learning rate warmup is critical for stable fine-tuning of pre-trained models.

Common Pitfalls

Using the wrong tokenizer: Each model has a specific tokenizer; mixing them produces garbage inputs.
Too-high learning rate: Destroys pre-trained representations (catastrophic forgetting).
Too many fine-tuning epochs: Overfits to the small downstream dataset.
Ignoring tokenization artifacts: Subword tokenization can split words in unexpected ways that affect token-level tasks like NER.
Comparing models without controlling for tokenization: Different tokenizers produce different sequence lengths, affecting fair comparison.

Looking Ahead

Chapter 21: Decoder-only models (GPT family) and autoregressive text generation.
Chapter 22: Scaling laws and the path from millions to billions of parameters.
Chapter 23: Fine-tuning with human feedback (RLHF) and instruction tuning.