> "The most important recent advance in NLP has been the development of pre-trained language representations that can be fine-tuned for a wide range of tasks." --- Jacob Devlin et al., 2019
In This Chapter
- 20.1 From Static to Contextualized Word Embeddings
- 20.2 BERT: Bidirectional Encoder Representations from Transformers
- 20.3 Tokenization: BPE, WordPiece, and SentencePiece
- 20.4 The HuggingFace Ecosystem
- 20.5 Transfer Learning Theory
- 20.6 Fine-Tuning BERT for Classification
- 20.7 Feature Extraction vs. Fine-Tuning
- 20.8 Exploring BERT's Representations
- 20.9 BERT Variants: RoBERTa, ALBERT, and DistilBERT
- 20.10 T5: Text-to-Text Transfer Transformer
- 20.11 Practical Considerations
- 20.12 The Broader Pre-training Landscape
- 20.13 Pre-training Data Curation
- 20.14 Advanced Fine-Tuning: Parameter-Efficient Methods
- 20.15 Connecting to What Comes Next
- Summary
Chapter 20: Pre-training and Transfer Learning for NLP
"The most important recent advance in NLP has been the development of pre-trained language representations that can be fine-tuned for a wide range of tasks." --- Jacob Devlin et al., 2019
In the preceding chapters, we built the Transformer architecture from the ground up (Chapter 18) and explored how attention mechanisms revolutionize sequence modeling (Chapter 19). Those chapters gave us the theoretical and practical foundations for understanding the most impactful development in modern NLP: pre-trained language models and the transfer learning paradigm they enable.
Transfer learning---the practice of leveraging knowledge gained from one task to improve performance on another---has been a cornerstone of computer vision for years. Practitioners routinely fine-tune ImageNet-pretrained models for specialized visual tasks. In NLP, however, the path to effective transfer learning was longer and more winding. Early word embeddings like Word2Vec and GloVe captured useful semantic relationships but were static: the same word always received the same vector regardless of context. It took the development of deep contextual representations---ELMo, and then the Transformer-based BERT---to unlock the full power of transfer learning for language.
This chapter marks a turning point in the book. We move beyond building models from scratch and embrace the HuggingFace ecosystem, the de facto standard library for working with pre-trained Transformer models. By the end of this chapter, you will understand how modern language models are pre-trained, how to apply them to downstream tasks through fine-tuning and feature extraction, and how the family of BERT variants (RoBERTa, ALBERT, DistilBERT) and the T5 text-to-text framework extend these ideas in powerful directions.
20.1 From Static to Contextualized Word Embeddings
20.1.1 Word2Vec and GloVe: A Brief Recap
Before diving into modern pre-training, let us briefly revisit the static embedding methods that laid the groundwork.
Word2Vec (Mikolov et al., 2013) introduced two architectures for learning dense word vectors from large corpora:
- Continuous Bag-of-Words (CBOW): Predicts a target word from its surrounding context words.
- Skip-gram: Predicts surrounding context words from a target word.
Both approaches learn embeddings by optimizing a prediction objective over a large unlabeled corpus. The resulting vectors capture remarkable semantic relationships---the famous example being:
$$\text{vec}(\text{"king"}) - \text{vec}(\text{"man"}) + \text{vec}(\text{"woman"}) \approx \text{vec}(\text{"queen"})$$
GloVe (Pennington et al., 2014) took a different approach, directly factorizing the global word-word co-occurrence matrix. The objective minimizes:
$$J = \sum_{i,j=1}^{V} f(X_{ij}) \left( \mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2$$
where $X_{ij}$ is the co-occurrence count, $f$ is a weighting function that caps the influence of very frequent pairs, and $\mathbf{w}_i$, $\tilde{\mathbf{w}}_j$ are the word and context vectors.
Both Word2Vec and GloVe produce a single fixed vector per word type. This is their fundamental limitation.
20.1.2 The Polysemy Problem
Consider the word "bank" in these sentences:
- "I deposited money at the bank."
- "We had a picnic on the river bank."
Static embeddings assign the same vector to "bank" in both cases. This polysemy problem extends far beyond obvious homonyms. Words like "run," "play," "set," and "light" have dozens of senses, and even words with a single dictionary definition shift meaning subtly based on context. Static embeddings average all these senses into a single point in vector space, losing crucial information.
20.1.3 ELMo: Embeddings from Language Models
ELMo (Embeddings from Language Models; Peters et al., 2018) was the first widely successful approach to contextualized word embeddings. Instead of a lookup table, ELMo generates a different representation for each word based on the entire sentence it appears in.
ELMo uses a two-layer bidirectional LSTM language model. The forward LSTM processes the sentence left-to-right, and the backward LSTM processes it right-to-left. For each token $k$ in a sentence, ELMo computes a task-specific combination of the layer representations:
$$\mathbf{ELMo}_k^{task} = \gamma^{task} \sum_{j=0}^{L} s_j^{task} \mathbf{h}_{k,j}$$
where $\mathbf{h}_{k,j}$ is the representation of token $k$ at layer $j$ (layer 0 being the character-level CNN embedding), $s_j^{task}$ are softmax-normalized learned mixture weights, and $\gamma^{task}$ is a task-specific scaling factor.
Key properties of ELMo:
- Contextual: The same word gets different representations in different sentences.
- Deep: It learns a linear combination of all biLSTM layers, capturing different levels of linguistic information (syntax at lower layers, semantics at higher layers).
- Character-based: The input layer uses character convolutions, providing robustness to out-of-vocabulary words.
ELMo demonstrated that pre-trained contextual representations could dramatically improve performance across NLP tasks. However, it relied on LSTMs, which---as we discussed in Chapter 19---have limitations in capturing long-range dependencies compared to attention-based models.
20.2 BERT: Bidirectional Encoder Representations from Transformers
20.2.1 The Key Insight: Bidirectional Context
The Transformer architecture (Chapter 18) enabled a leap forward. BERT (Devlin et al., 2019) applies the Transformer encoder to language model pre-training with one crucial innovation: it reads text bidirectionally.
Traditional language models are unidirectional---they predict the next word given only previous words (left-to-right) or the previous word given only future words (right-to-left). Even ELMo, despite being "bidirectional," merely concatenates independently trained left-to-right and right-to-left models. BERT trains a single model that can attend to context on both sides simultaneously.
This is made possible by a clever pre-training objective: Masked Language Modeling (MLM).
20.2.2 Architecture
BERT uses only the encoder portion of the Transformer architecture we built in Chapter 18. Recall that the encoder consists of stacked layers, each containing:
- Multi-head self-attention (allowing every token to attend to every other token)
- Position-wise feed-forward network
- Layer normalization and residual connections
BERT comes in two sizes:
| Parameter | BERT-Base | BERT-Large |
|---|---|---|
| Layers ($L$) | 12 | 24 |
| Hidden size ($H$) | 768 | 1024 |
| Attention heads ($A$) | 12 | 16 |
| Total parameters | 110M | 340M |
The input representation sums three embeddings for each token:
$$\mathbf{e}_i = \mathbf{e}_i^{\text{token}} + \mathbf{e}_i^{\text{segment}} + \mathbf{e}_i^{\text{position}}$$
- Token embeddings: Standard embedding lookup for each subword token.
- Segment embeddings: Distinguish between Sentence A and Sentence B (for sentence-pair tasks).
- Position embeddings: Learned absolute position embeddings (unlike the sinusoidal encodings we implemented in Chapter 18).
The input format uses special tokens:
[CLS] Tokens of sentence A [SEP] Tokens of sentence B [SEP]
The [CLS] token's final hidden state serves as a summary representation for classification tasks. The [SEP] token separates the two sentences.
20.2.3 Pre-training Objective 1: Masked Language Modeling (MLM)
To enable bidirectional training without "seeing the answer," BERT randomly masks 15% of input tokens and trains the model to predict them. Specifically, for each selected token:
- 80% of the time, replace with
[MASK] - 10% of the time, replace with a random token
- 10% of the time, keep unchanged
This 80-10-10 strategy is carefully designed to address a subtle problem. If BERT only saw [MASK] tokens during pre-training, it would learn representations that are useful only when [MASK] is present --- but [MASK] never appears during fine-tuning. By sometimes using random tokens and sometimes keeping the original, BERT learns robust representations that work even without the [MASK] signal.
Why 15%? The masking rate is a tradeoff. Too low (e.g., 5%), and the model sees too few prediction targets per batch, making training inefficient. Too high (e.g., 40%), and there is too little context for meaningful predictions --- the task becomes nearly impossible. The 15% rate provides a good balance between training efficiency and context availability. Later work (RoBERTa, SpanBERT) explored different masking strategies and found that the exact rate matters less than other training decisions.
The MLM loss for masked positions $\mathcal{M}$ is:
$$\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}})$$
where $\mathbf{x}_{\setminus \mathcal{M}}$ denotes all tokens except the masked ones, and $P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}})$ is computed by passing the final hidden state through a linear layer and softmax over the vocabulary.
Worked example. Consider the input: "The cat [MASK] on the mat". The model sees all tokens except the masked one and must predict it. The final hidden state at the [MASK] position is projected through a weight matrix $\mathbf{W} \in \mathbb{R}^{H \times V}$ (where $V$ is the vocabulary size) and a softmax:
$$P(\text{"sat"} \mid \text{context}) = \frac{\exp(\mathbf{w}_{\text{sat}}^\top \mathbf{h}_{[\text{MASK}]})}{\sum_{v \in \mathcal{V}} \exp(\mathbf{w}_v^\top \mathbf{h}_{[\text{MASK}]})}$$
The model is trained to assign high probability to the correct token "sat" using cross-entropy loss. Because attention is bidirectional, the model can use both left context ("The cat") and right context ("on the mat") to make its prediction.
20.2.4 Pre-training Objective 2: Next Sentence Prediction (NSP)
BERT's second objective trains the model to predict whether two sentences are consecutive in the original text. Given a pair (A, B):
- 50% of the time, B is the actual next sentence (IsNext)
- 50% of the time, B is a random sentence from the corpus (NotNext)
The [CLS] token representation is passed through a classification layer:
$$P(\text{IsNext} \mid A, B) = \text{softmax}(\mathbf{W} \cdot \mathbf{h}_{[\text{CLS}]} + \mathbf{b})$$
The total pre-training loss combines both objectives:
$$\mathcal{L} = \mathcal{L}_{\text{MLM}} + \mathcal{L}_{\text{NSP}}$$
The NSP controversy. NSP was intended to help BERT understand relationships between sentences, which is important for tasks like Natural Language Inference (NLI) and Question Answering (QA). However, subsequent research (RoBERTa, Section 20.8.1) found that NSP may actually hurt performance. The problem is that NSP is too easy: randomly sampled negative pairs typically come from different documents and differ in both topic and style, so the model can solve NSP by detecting topic similarity rather than learning discourse structure. ALBERT (Section 20.8.2) replaced NSP with the harder Sentence Order Prediction (SOP), which uses the same two sentences but swaps their order for negative examples.
20.2.5 BERT Architecture Walkthrough
Let us trace the flow of data through BERT step by step to solidify understanding.
Step 1: Tokenization. The input sentence "The cat sat on the mat" is tokenized using WordPiece:
["[CLS]", "the", "cat", "sat", "on", "the", "mat", "[SEP]"]
Each token is mapped to an integer ID via the vocabulary lookup table.
Step 2: Embedding. Each token ID is converted to a dense vector by summing three embeddings:
$$\mathbf{e}_i = \mathbf{e}_i^{\text{token}} + \mathbf{e}_i^{\text{segment}} + \mathbf{e}_i^{\text{position}}$$
For BERT-Base, each embedding is 768-dimensional, so the input to the first Transformer layer is a matrix of shape $(8, 768)$ for our 8-token input.
Step 3: Transformer layers. The embedding matrix passes through 12 identical Transformer encoder layers. Each layer applies:
-
Multi-head self-attention (as we built in Chapter 18): Each token attends to every other token, producing a context-aware representation. BERT-Base uses 12 attention heads, each operating on 64 dimensions.
-
Residual connection + layer normalization: $\mathbf{x}' = \text{LayerNorm}(\mathbf{x} + \text{MultiHeadAttn}(\mathbf{x}))$
-
Position-wise feed-forward network: Two linear layers with a GELU activation: $$\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2$$ where $\mathbf{W}_1 \in \mathbb{R}^{768 \times 3072}$ and $\mathbf{W}_2 \in \mathbb{R}^{3072 \times 768}$. The intermediate dimension (3072 = $4 \times 768$) provides additional capacity.
-
Another residual connection + layer normalization.
After 12 layers, each token has been transformed through 12 rounds of attention and feed-forward processing, progressively building up from surface-level features to deep semantic representations.
Step 4: Output heads. The final hidden states are used for the pre-training objectives:
- The [CLS] representation feeds into the NSP classification head.
- The masked positions feed into the MLM prediction head (a linear layer + GELU + LayerNorm + linear layer projecting to vocabulary size).
20.2.6 Pre-training Data and Compute
BERT was pre-trained on:
- BooksCorpus (800M words): A collection of unpublished books providing diverse long-form text.
- English Wikipedia (2,500M words): Text only, excluding lists, tables, and headers.
Pre-training BERT-Large required 4 days on 16 TPU v3 chips---a significant but now standard computational investment. The total training processed approximately 3.3 billion words over 1 million training steps with a batch size of 256 sequences of 512 tokens.
To put this in perspective: at the time of publication (2019), pre-training BERT-Large cost roughly $7,000--$12,000 in cloud compute. By modern standards, this is extremely modest --- pre-training GPT-4 is estimated to have cost tens of millions of dollars. But BERT demonstrated that meaningful pre-training was achievable for academic and small-industry labs.
20.3 Tokenization: BPE, WordPiece, and SentencePiece
A critical and often overlooked component of modern NLP models is the tokenizer---the algorithm that converts raw text into the subword units the model processes. Unlike traditional word-level tokenization, modern approaches operate at the subword level, balancing vocabulary size with the ability to represent any text.
20.3.1 Why Subword Tokenization?
Word-level tokenization faces two problems:
- Vocabulary explosion: Natural language has an enormous number of word forms (inflections, compounds, neologisms). A word-level vocabulary must be extremely large or suffer from many out-of-vocabulary (OOV) tokens.
- Morphological blindness: "run," "running," "runner," and "runs" are treated as completely unrelated tokens.
Character-level tokenization solves OOV issues but produces very long sequences and loses word-level semantics.
Subword tokenization is the sweet spot: common words are kept whole, while rare words are split into meaningful subword units.
20.3.2 Byte-Pair Encoding (BPE)
BPE (Sennrich et al., 2016) starts with a character-level vocabulary and iteratively merges the most frequent pair of adjacent tokens:
- Initialize the vocabulary with all individual characters (plus an end-of-word symbol).
- Count all adjacent token pairs in the corpus.
- Merge the most frequent pair into a new token.
- Repeat steps 2-3 for a desired number of merges (determining vocabulary size).
Detailed worked example. Given the corpus with word frequencies:
{"low": 5, "lower": 2, "newest": 6, "widest": 3}
First, we split each word into characters with end-of-word markers:
l o w </w> (frequency: 5)
l o w e r </w> (frequency: 2)
n e w e s t </w> (frequency: 6)
w i d e s t </w> (frequency: 3)
Count all adjacent pairs:
- e s appears $6 + 3 = 9$ times (in "newest" and "widest")
- s t appears $6 + 3 = 9$ times
- l o appears $5 + 2 = 7$ times
- ... and so on.
Merge 1: e + s -> es (frequency 9). Update:
l o w </w> (5)
l o w e r </w> (2)
n e w es t </w> (6)
w i d es t </w> (3)
Merge 2: es + t -> est (frequency 9). Update:
l o w </w> (5)
l o w e r </w> (2)
n e w est </w> (6)
w i d est </w> (3)
Merge 3: est + </w> -> est</w> (frequency 9). Now "est" becomes a word-final token.
Merge 4: l + o -> lo (frequency 7).
And so on. After sufficient merges, common words become single tokens while rare words remain split.
At encoding time, a new word is split greedily using the learned merge rules. For example, "lowest" might be tokenized as ["low", "est"] if those merges were learned.
BPE is used by GPT-2, GPT-3, and RoBERTa. Byte-level BPE (used by GPT-2) operates on bytes rather than Unicode characters, ensuring that any text can be tokenized without OOV tokens. The vocabulary is initialized with 256 byte tokens, and merges are performed on byte sequences.
Practical tip. The number of BPE merges directly determines the vocabulary size. GPT-2 uses 50,257 tokens. Larger vocabularies mean shorter sequences (good for efficiency) but larger embedding matrices (more parameters). The typical range is 30,000--100,000 tokens.
20.3.3 WordPiece
WordPiece (Schuster and Nakajima, 2012) is similar to BPE but selects merges based on likelihood rather than frequency. Instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data when added to the vocabulary:
$$\text{score}(x, y) = \frac{\text{freq}(xy)}{\text{freq}(x) \times \text{freq}(y)}$$
WordPiece prefixes non-initial subword tokens with ## to indicate continuation:
"embeddings" -> ["embed", "##ding", "##s"]
WordPiece is used by BERT and its variants.
20.3.4 SentencePiece
SentencePiece (Kudo and Richardson, 2018) treats the input as a raw stream of Unicode characters, including whitespace. This makes it language-agnostic---it does not require pre-tokenization or language-specific rules. SentencePiece supports both BPE and unigram language model tokenization.
The unigram approach starts with a large vocabulary and iteratively removes tokens that least reduce the overall likelihood:
$$\mathcal{L} = \sum_{i=1}^{|D|} \log \left( \sum_{\mathbf{x} \in S(x_i)} \prod_{j=1}^{|\mathbf{x}|} p(x_j) \right)$$
where $S(x_i)$ is the set of all possible segmentations of sentence $x_i$.
The unigram approach is fundamentally different from BPE. While BPE is greedy (always merge the most frequent pair), the unigram model is probabilistic: it assigns a probability to each token and finds the most likely segmentation of a given sentence using the Viterbi algorithm. This can produce different segmentations for the same word in different contexts, which provides a form of subword regularization that can improve model robustness.
SentencePiece is used by T5, ALBERT, and many multilingual models. Its language-agnostic design makes it especially valuable for multilingual and cross-lingual models that must handle diverse scripts without language-specific preprocessing.
20.3.5 Training a Custom Tokenizer
In practice, you may need to train a tokenizer on domain-specific data. The HuggingFace tokenizers library makes this straightforward:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# Initialize a BPE tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
# Train on a text file
trainer = BpeTrainer(
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
vocab_size=30000,
min_frequency=2,
)
tokenizer.train(files=["my_corpus.txt"], trainer=trainer)
# Use the trained tokenizer
output = tokenizer.encode("Custom tokenization for domain data")
print(f"Tokens: {output.tokens}")
print(f"IDs: {output.ids}")
Training a custom tokenizer is recommended when your domain has specialized vocabulary not well-covered by the pre-trained tokenizer (e.g., chemical formulas, code syntax, medical terminology). A poorly matched tokenizer forces the model to represent common domain terms as sequences of subwords, wasting sequence length and making learning harder.
20.3.6 Tokenization in Practice
The choice of tokenizer has meaningful consequences:
| Tokenizer | Models | Special Prefix | Vocabulary Size |
|---|---|---|---|
| BPE | GPT-2, RoBERTa | \u0120 (space prefix) |
50,257 (GPT-2) |
| WordPiece | BERT, DistilBERT | ## (continuation) |
30,522 (BERT) |
| SentencePiece | T5, ALBERT | \u2581 (space prefix) |
32,000 (T5) |
20.4 The HuggingFace Ecosystem
With this chapter, we introduce the HuggingFace ecosystem---a collection of open-source libraries that has become the standard toolkit for working with pre-trained Transformer models. Throughout the remainder of this book, we will use these libraries alongside PyTorch.
20.4.1 Transformers Library
The transformers library provides:
- Pre-trained models: Thousands of pre-trained models for NLP, vision, audio, and multimodal tasks.
- Unified API: A consistent interface across different model architectures (
AutoModel,AutoTokenizer, etc.). - Training utilities: The
Trainerclass simplifies fine-tuning with distributed training, mixed precision, and logging.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Transfer learning is powerful.", return_tensors="pt")
outputs = model(**inputs)
20.4.2 Tokenizers Library
The tokenizers library provides fast, Rust-backed tokenizer implementations. It supports:
- BPE, WordPiece, and Unigram tokenization
- Pre-tokenization, normalization, and post-processing pipelines
- Training custom tokenizers on new data
20.4.3 Datasets Library
The datasets library provides:
- Efficient loading and processing of thousands of NLP datasets
- Memory-mapped Apache Arrow format for handling large datasets without loading everything into RAM
- Built-in support for common benchmarks (GLUE, SuperGLUE, SQuAD)
from datasets import load_dataset
dataset = load_dataset("glue", "sst2")
print(dataset["train"][0])
# {'sentence': 'hide new secretions from the parental units', 'label': 0, 'idx': 0}
20.4.4 The Auto Classes
The Auto classes are a powerful abstraction. Rather than specifying the exact model class, you can use:
AutoTokenizer: Automatically selects the correct tokenizerAutoModel: Loads the base model (encoder outputs)AutoModelForSequenceClassification: Loads a model with a classification headAutoModelForTokenClassification: Loads a model with a per-token classification headAutoModelForQuestionAnswering: Loads a model configured for extractive QA
These classes inspect the model configuration and instantiate the appropriate architecture.
20.5 Transfer Learning Theory
20.5.1 Why Does Pre-training Help?
The success of pre-training is remarkable when you consider the surface-level disconnect: BERT is pre-trained to predict masked words, but it is fine-tuned for tasks like sentiment analysis, named entity recognition, and question answering. Why should predicting masked words teach the model anything about sentiment?
The answer lies in the nature of language understanding. To predict a masked word accurately, the model must understand:
- Syntax: Grammar constrains which words can fill a slot ("The dog [MASK] the ball" requires a verb).
- Semantics: Meaning constrains the choice further ("The dog [MASK] the ball" is likely "chased" or "caught," not "computed").
- World knowledge: Factual knowledge helps ("The capital of France is [MASK]" requires knowing the answer is "Paris").
- Discourse: Context beyond the immediate sentence helps ("She was exhausted. She decided to [MASK]" benefits from understanding causality).
These capabilities are precisely what downstream NLP tasks require. Pre-training on masked language modeling forces the model to develop general linguistic competence, which then transfers to specific tasks.
20.5.2 The PAC-Bayes Perspective on Transfer Learning
A more formal perspective comes from PAC-Bayes theory. The generalization bound for a model with parameters $\boldsymbol{\theta}$ fine-tuned from a pre-trained initialization $\boldsymbol{\theta}_0$ is approximately:
$$\text{Generalization gap} \leq \sqrt{\frac{D_{\text{KL}}(\boldsymbol{\theta} \| \boldsymbol{\theta}_0) + \log(N/\delta)}{2N}}$$
where $N$ is the number of fine-tuning examples, $\delta$ is the confidence parameter, and $D_{\text{KL}}(\boldsymbol{\theta} \| \boldsymbol{\theta}_0)$ measures how far the fine-tuned parameters deviate from the pre-trained initialization.
The key insight is that the bound is tight when $\boldsymbol{\theta}$ stays close to $\boldsymbol{\theta}_0$. Pre-training provides an initialization that is already close to a good solution for many tasks, so fine-tuning only needs small updates. This explains why:
- Small learning rates are used for fine-tuning: Large learning rates would push $\boldsymbol{\theta}$ far from $\boldsymbol{\theta}_0$, increasing the KL term.
- Few fine-tuning epochs are needed: The model starts near a good solution and needs only minor adjustments.
- Pre-training is especially valuable for small datasets: With small $N$, the KL term dominates, and starting close to a good solution is critical.
20.5.3 Domain Shift and Pre-training Data
Transfer learning works best when the pre-training domain is similar to the downstream domain. BERT, pre-trained on Wikipedia and BookCorpus, transfers well to general English NLP tasks. But for specialized domains, the vocabulary, syntax, and factual knowledge may differ significantly.
This has motivated a family of domain-specific pre-trained models:
| Domain | Model | Pre-training Data |
|---|---|---|
| Biomedical | BioBERT, PubMedBERT | PubMed abstracts, PMC full-text articles |
| Scientific | SciBERT | Semantic Scholar papers |
| Clinical | ClinicalBERT | MIMIC-III clinical notes |
| Legal | Legal-BERT | Legal documents and case law |
| Financial | FinBERT | Financial news and SEC filings |
These models either continue pre-training BERT on domain-specific text (domain-adaptive pre-training) or pre-train from scratch on domain-specific corpora. Research consistently shows that domain-adaptive pre-training improves downstream performance on in-domain tasks, sometimes by large margins.
20.6 Fine-Tuning BERT for Classification
20.6.1 The Transfer Learning Pipeline
The standard pipeline for using BERT on a downstream task is:
- Load a pre-trained BERT model and tokenizer.
- Add a task-specific head (typically a linear layer on top of
[CLS]). - Fine-tune the entire model (or just the head) on labeled task data.
- Evaluate on a held-out test set.
This is far more data-efficient than training from scratch. Tasks that previously required hundreds of thousands of labeled examples can now achieve strong performance with just a few thousand.
20.6.2 Single-Sentence Classification
For tasks like sentiment analysis, the input format is:
[CLS] This movie was fantastic! [SEP]
The [CLS] representation is passed through a dropout layer and a linear classifier:
$$\hat{y} = \text{softmax}(\mathbf{W} \cdot \text{Dropout}(\mathbf{h}_{[\text{CLS}]}) + \mathbf{b})$$
The cross-entropy loss is computed against the ground-truth label, and gradients flow through the entire model---from the classification head all the way back through the Transformer layers, updating the pre-trained weights.
20.6.3 Sentence-Pair Classification
For tasks like natural language inference (NLI) or paraphrase detection:
[CLS] Sentence A [SEP] Sentence B [SEP]
The same [CLS]-based classification approach applies. The segment embeddings distinguish between the two sentences.
20.6.4 Token Classification (Named Entity Recognition)
BERT can also be fine-tuned for token classification tasks, where each token receives its own label. Named Entity Recognition (NER) is the canonical example: given a sentence, identify which tokens are person names, organizations, locations, etc.
For NER, we use the hidden state of each token (not just [CLS]) and pass it through a classification head:
$$\hat{y}_i = \text{softmax}(\mathbf{W} \cdot \mathbf{h}_i + \mathbf{b}) \quad \text{for each token } i$$
A subtlety arises with WordPiece tokenization: the word "Washington" might be tokenized as ["Wash", "##ing", "##ton"]. Only the first subtoken should receive the NER label; the continuation tokens are typically labeled with a special ignore index (-100 in PyTorch) so they do not contribute to the loss.
import torch
from transformers import (
AutoTokenizer,
AutoModelForTokenClassification,
)
torch.manual_seed(42)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForTokenClassification.from_pretrained(
"bert-base-uncased", num_labels=9 # BIO tags for NER
)
text = "Barack Obama was born in Honolulu"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
print(f"Logits shape: {outputs.logits.shape}") # (1, seq_len, 9)
print(f"Predictions: {predictions}")
20.6.5 Extractive Question Answering
For extractive QA (given a question and a context passage, find the answer span in the context), BERT uses two classification heads: one to predict the start position of the answer and one to predict the end position:
$$P(\text{start} = i) = \frac{\exp(\mathbf{w}_s^\top \mathbf{h}_i)}{\sum_j \exp(\mathbf{w}_s^\top \mathbf{h}_j)}, \quad P(\text{end} = i) = \frac{\exp(\mathbf{w}_e^\top \mathbf{h}_i)}{\sum_j \exp(\mathbf{w}_e^\top \mathbf{h}_j)}$$
where $\mathbf{w}_s$ and $\mathbf{w}_e$ are learned weight vectors. The answer is extracted as the span from the highest-scoring start position to the highest-scoring end position (subject to the constraint that end >= start).
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
torch.manual_seed(42)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")
question = "What is the capital of France?"
context = "Paris is the capital and most populous city of France."
inputs = tokenizer(question, context, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits)
answer_tokens = inputs["input_ids"][0][start_idx : end_idx + 1]
answer = tokenizer.decode(answer_tokens)
print(f"Answer: {answer}")
20.6.6 Fine-Tuning Hyperparameters
The original BERT paper recommends the following hyperparameter ranges for fine-tuning:
| Hyperparameter | Recommended Range |
|---|---|
| Learning rate | 2e-5, 3e-5, 5e-5 |
| Batch size | 16, 32 |
| Epochs | 2, 3, 4 |
| Warmup proportion | 0.1 |
| Weight decay | 0.01 |
Critical insight: Fine-tuning uses a much smaller learning rate than training from scratch. Large learning rates would destroy the pre-trained representations---a phenomenon called catastrophic forgetting. The linear warmup schedule gradually increases the learning rate from 0 to the target value over the first 10% of training steps, which stabilizes training.
20.6.5 Implementation with HuggingFace
Here is the complete fine-tuning workflow:
import torch
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
)
from datasets import load_dataset
import numpy as np
torch.manual_seed(42)
# Load dataset and model
dataset = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
# Tokenize
def tokenize_function(examples):
return tokenizer(
examples["sentence"],
padding="max_length",
truncation=True,
max_length=128,
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Define metrics
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = (predictions == labels).mean()
return {"accuracy": accuracy}
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
warmup_ratio=0.1,
seed=42,
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
compute_metrics=compute_metrics,
)
trainer.train()
20.7 Feature Extraction vs. Fine-Tuning
There are two main strategies for using pre-trained models:
20.7.1 Fine-Tuning (Updating All Parameters)
In fine-tuning, the pre-trained model's weights are updated during training on the downstream task. This allows the representations to adapt to the specific task.
Advantages: - Generally achieves the best performance - The model can learn task-specific features
Disadvantages: - Requires more compute and memory (gradients for all parameters) - Risk of catastrophic forgetting with too few examples or too many epochs - A separate copy of the model is needed for each task
20.7.2 Feature Extraction (Freezing the Model)
In feature extraction, the pre-trained model's weights are frozen. The model is used only to generate fixed representations, which are then fed into a separate, lightweight classifier.
import torch
from transformers import AutoTokenizer, AutoModel
torch.manual_seed(42)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Freeze all parameters
for param in model.parameters():
param.requires_grad = False
# Get embeddings
inputs = tokenizer("Feature extraction is efficient.", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Use [CLS] token embedding as input to a separate classifier
cls_embedding = outputs.last_hidden_state[:, 0, :] # Shape: (1, 768)
Advantages: - Much faster---only the classifier is trained - One model serves multiple tasks - Lower memory requirements
Disadvantages: - Typically lower performance than fine-tuning - The frozen representations may not be optimal for the specific task
20.7.3 Partial Fine-Tuning (Freezing Lower Layers)
A middle ground is partial fine-tuning: freeze the lower Transformer layers and only fine-tune the upper layers plus the task head. This preserves low-level linguistic features while allowing high-level representations to adapt.
import torch
from transformers import AutoModelForSequenceClassification
torch.manual_seed(42)
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
# Freeze embedding layer and first 8 Transformer layers
for param in model.bert.embeddings.parameters():
param.requires_grad = False
for i in range(8):
for param in model.bert.encoder.layer[i].parameters():
param.requires_grad = False
# Only layers 9-11 and the classifier head will be updated
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.1f}%)")
20.7.4 Layer-wise Learning Rate Decay (Discriminative Fine-tuning)
A more sophisticated approach assigns different learning rates to different layers. The intuition is that lower layers capture general linguistic features that should change little, while upper layers capture task-specific features that need more adaptation.
The learning rate for layer $l$ (out of $L$ total layers) is:
$$\text{lr}_l = \text{lr}_{\text{base}} \cdot \xi^{L - l}$$
where $\xi \in (0, 1)$ is the decay factor (typically $\xi = 0.95$). Layer $L$ (the top layer) gets the base learning rate, layer $L-1$ gets $0.95 \times$ the base rate, and so on. This approach, introduced by Howard and Ruder (2018) in the ULMFiT framework, consistently improves fine-tuning performance.
import torch
from transformers import AutoModelForSequenceClassification
torch.manual_seed(42)
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
base_lr = 2e-5
decay = 0.95
num_layers = 12
# Create parameter groups with layer-wise learning rates
param_groups = []
for i in range(num_layers):
lr = base_lr * (decay ** (num_layers - i))
params = list(model.bert.encoder.layer[i].parameters())
param_groups.append({"params": params, "lr": lr})
# Classifier head gets the base learning rate
param_groups.append({
"params": list(model.classifier.parameters()),
"lr": base_lr,
})
optimizer = torch.optim.AdamW(param_groups, weight_decay=0.01)
print(f"Layer 0 lr: {param_groups[0]['lr']:.2e}")
print(f"Layer 11 lr: {param_groups[11]['lr']:.2e}")
print(f"Classifier lr: {param_groups[12]['lr']:.2e}")
20.7.5 When to Use Each Strategy
| Scenario | Recommended Strategy |
|---|---|
| Large labeled dataset, sufficient compute | Full fine-tuning |
| Small labeled dataset (< 1,000 examples) | Feature extraction or partial fine-tuning |
| Multiple tasks, single model | Feature extraction |
| Domain very different from pre-training data | Full fine-tuning |
| Rapid prototyping | Feature extraction |
| Maximum performance with limited data | Layer-wise learning rate decay |
20.8 Exploring BERT's Representations
20.8.1 What Do Different Layers Capture?
Research (Jawahar et al., 2019; Tenney et al., 2019) has shown that BERT's layers encode a hierarchy of linguistic information:
- Lower layers (1--4): Surface-level features---part-of-speech tags, simple syntax.
- Middle layers (5--8): Syntactic structure---parse trees, dependency relations.
- Upper layers (9--12): Semantic features---coreference, entity types, semantic roles.
This hierarchy motivates partial fine-tuning: lower layers capture universal linguistic knowledge that transfers well across tasks, while upper layers capture more task-relevant semantics.
20.8.2 Attention Head Analysis
Individual attention heads in BERT often specialize:
- Some heads track syntactic relations (subject-verb agreement).
- Some heads attend to the next or previous token (positional patterns).
- Some heads in later layers track coreference chains.
This emergent specialization happens purely from the pre-training objectives, without explicit supervision for these linguistic phenomena.
20.8.3 The [CLS] Token
The [CLS] token representation is often used as a "sentence embedding." However, research (Reimers and Gurevych, 2019) has shown that the mean of all token embeddings sometimes produces better sentence representations, especially for tasks like semantic similarity.
For a sequence of $n$ token representations $\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_n$, the mean pooling strategy computes:
$$\mathbf{s} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{h}_i$$
Alternatively, max pooling takes the element-wise maximum:
$$\mathbf{s}_j = \max_{i=1}^{n} \mathbf{h}_{i,j} \quad \text{for each dimension } j$$
20.9 BERT Variants: RoBERTa, ALBERT, and DistilBERT
The success of BERT spawned a family of variants, each addressing different limitations.
20.9.1 RoBERTa: A Robustly Optimized BERT
RoBERTa (Liu et al., 2019) demonstrates that BERT was significantly undertrained and makes several key changes:
- Removes NSP: The Next Sentence Prediction objective is eliminated. RoBERTa shows it does not help and can hurt performance.
- Dynamic masking: Instead of creating masked sequences once during preprocessing, masks are generated dynamically each time a sequence is fed to the model.
- Larger batches: Trained with batch sizes of 8,192 (vs. BERT's 256).
- More data: Trained on 160GB of text (vs. BERT's 16GB), including CC-News, OpenWebText, and Stories.
- Longer training: 500K steps (vs. BERT's 1M steps, but with larger batches, so more total tokens).
- BPE tokenization: Uses byte-level BPE instead of WordPiece.
These seemingly simple changes yield significant improvements. RoBERTa achieves state-of-the-art results on GLUE, SQuAD, and RACE benchmarks, demonstrating that training recipe matters as much as architecture.
The RoBERTa study is one of the most important ablation studies in NLP. It systematically tested each decision in BERT's training pipeline and found that BERT left significant performance on the table. The key lesson for practitioners is: before designing a new architecture, make sure you have fully optimized the training recipe for existing architectures. This principle --- "compute-optimal training" --- has become a central theme in modern LLM development.
Using RoBERTa with HuggingFace:
import torch
from transformers import AutoTokenizer, AutoModel
torch.manual_seed(42)
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")
# RoBERTa uses byte-level BPE (note: different special tokens)
text = "RoBERTa improves on BERT with better training."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Note: RoBERTa uses <s> and </s> instead of [CLS] and [SEP]
print(f"Token IDs: {inputs['input_ids'][0][:5].tolist()}")
print(f"Output shape: {outputs.last_hidden_state.shape}")
20.9.2 ALBERT: A Lite BERT
ALBERT (Lan et al., 2020) addresses BERT's parameter inefficiency with two techniques:
Factorized Embedding Parameterization. BERT ties the embedding dimension $E$ to the hidden dimension $H$. Since $H$ must be large for effective representations, this makes the embedding matrix ($V \times H$) enormous. ALBERT decouples them by projecting through a smaller dimension:
$$\mathbf{e} \in \mathbb{R}^{V \times E} \quad \rightarrow \quad \mathbf{W}_{\text{proj}} \in \mathbb{R}^{E \times H}$$
With $V = 30{,}000$, $E = 128$, $H = 768$: BERT uses $30{,}000 \times 768 = 23.0\text{M}$ parameters, while ALBERT uses $30{,}000 \times 128 + 128 \times 768 = 3.9\text{M}$.
Cross-Layer Parameter Sharing. All Transformer layers share the same parameters. This dramatically reduces the model size:
| Model | Layers | Hidden | Parameters |
|---|---|---|---|
| BERT-Base | 12 | 768 | 110M |
| ALBERT-Base | 12 | 768 | 12M |
| ALBERT-xxlarge | 12 | 4096 | 235M |
ALBERT also replaces NSP with Sentence Order Prediction (SOP), which is harder and more useful: given two consecutive sentences, predict whether they are in the correct order.
20.9.3 DistilBERT: Distilled BERT
DistilBERT (Sanh et al., 2019) uses knowledge distillation to compress BERT into a smaller, faster model. A smaller "student" model is trained to mimic the outputs of the larger "teacher" model.
The distillation loss combines three components:
$$\mathcal{L} = \alpha \cdot \mathcal{L}_{\text{CE}} + \beta \cdot \mathcal{L}_{\text{distill}} + \gamma \cdot \mathcal{L}_{\text{cos}}$$
- $\mathcal{L}_{\text{CE}}$: Standard cross-entropy with true labels
- $\mathcal{L}_{\text{distill}}$: KL divergence between the student and teacher soft probability distributions (with temperature scaling)
- $\mathcal{L}_{\text{cos}}$: Cosine embedding loss between student and teacher hidden states
The soft probability distribution with temperature $T$ is:
$$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$
Higher temperatures produce softer distributions, transferring more information about the teacher's internal knowledge (which tokens it considers similar).
DistilBERT results:
| Metric | BERT-Base | DistilBERT |
|---|---|---|
| Parameters | 110M | 66M (40% smaller) |
| Inference speed | 1x | 1.6x faster |
| GLUE score | 79.5 | 77.0 (97% retained) |
20.9.4 Comparing the Variants
| Model | Key Innovation | Parameters | Speed | Accuracy |
|---|---|---|---|---|
| BERT-Base | MLM + NSP | 110M | 1.0x | Baseline |
| RoBERTa | Better training | 125M | 1.0x | Higher |
| ALBERT-Base | Shared params | 12M | 0.7x* | Similar |
| DistilBERT | Distillation | 66M | 1.6x | ~97% of BERT |
*ALBERT-Base is slower despite fewer parameters because parameter sharing means the same weights are reused across layers, not that computation is reduced.
20.10 T5: Text-to-Text Transfer Transformer
20.10.1 The Text-to-Text Framework
T5 (Raffel et al., 2020) proposes a radical simplification: every NLP task is cast as a text-to-text problem. Classification, summarization, translation, question answering---all are reformulated as generating text from text.
Examples:
| Task | Input | Target |
|---|---|---|
| Sentiment | sst2 sentence: this movie is great |
positive |
| Translation | translate English to German: Hello |
Hallo |
| Summarization | summarize: [long article] |
[summary] |
| Similarity | stsb sentence1: A cat sits. sentence2: A cat is sitting. |
5.0 |
This unified format means:
- The same model, same loss function, same training procedure works for every task.
- No task-specific heads or architectures are needed.
- New tasks can be defined simply by choosing a new text prefix.
20.10.2 Architecture
T5 uses the full encoder-decoder Transformer architecture (Chapter 18), unlike BERT which uses only the encoder. The encoder processes the input text, and the decoder generates the output text autoregressively.
Key architectural differences from the original Transformer:
- Relative position embeddings: Instead of absolute position embeddings, T5 uses a simplified version of relative position biases that are shared across layers.
- Pre-norm: Layer normalization is applied before (not after) each sublayer.
- No bias terms: Linear layers omit bias for efficiency.
20.10.3 Pre-training Objective: Span Corruption
T5 is pre-trained with a span corruption objective---a generalization of BERT's MLM. Instead of masking individual tokens, T5 masks contiguous spans and replaces them with sentinel tokens:
Input: The <X> sat on the <Y> and purred loudly.
Target: <X> cat <Y> mat
The model learns to generate the missing spans, conditioned on the corrupted input. This is more challenging than single-token prediction and requires understanding larger contexts.
20.10.4 The C4 Dataset
T5 was pre-trained on the Colossal Clean Crawled Corpus (C4)---approximately 750GB of cleaned English text from Common Crawl. The cleaning pipeline removes:
- Pages with fewer than 5 sentences
- Pages containing offensive words
- Duplicate content
- Non-English text
- Pages with code (curly braces as a heuristic)
20.10.5 T5 Sizes
| Variant | Parameters | Encoder Layers | Decoder Layers | Hidden |
|---|---|---|---|---|
| T5-Small | 60M | 6 | 6 | 512 |
| T5-Base | 220M | 12 | 12 | 768 |
| T5-Large | 770M | 24 | 24 | 1024 |
| T5-3B | 3B | 24 | 24 | 1024 |
| T5-11B | 11B | 24 | 24 | 1024 |
20.10.6 Using T5 with HuggingFace
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
torch.manual_seed(42)
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
# Sentiment analysis as text generation
input_text = "sst2 sentence: This movie is absolutely wonderful!"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
with torch.no_grad():
output_ids = model.generate(input_ids, max_new_tokens=10)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
# "positive"
20.10.7 T5 vs. BERT: When to Use Each
The choice between encoder-only (BERT) and encoder-decoder (T5) architectures depends on your task:
Use BERT/RoBERTa when: - Your task has a fixed output format (classification, token labeling, span extraction). - You need fast inference (encoder-only models are faster than encoder-decoder for classification). - Your fine-tuning dataset is small and you want maximum parameter efficiency.
Use T5 when: - Your task requires generating variable-length text output (summarization, translation, open-ended QA). - You want to use a single model architecture for multiple different tasks. - You want to frame new tasks flexibly --- just choose a new text prefix. - You are comfortable with the slightly higher computational cost of an encoder-decoder model.
The T5 paper itself provides one of the most comprehensive empirical studies in NLP, systematically comparing dozens of pre-training objectives, architectures, and training strategies. Reading the T5 paper (Raffel et al., 2020) is highly recommended for anyone serious about understanding the design space of pre-trained language models.
20.11 Practical Considerations
20.11.1 Choosing the Right Pre-trained Model
When selecting a pre-trained model, consider:
-
Task type: Encoder-only models (BERT, RoBERTa) excel at classification and extraction tasks. Encoder-decoder models (T5) excel at generation tasks. Decoder-only models (GPT) excel at open-ended generation.
-
Computational budget: DistilBERT for production with tight latency requirements; BERT-Base for a good accuracy-speed tradeoff; RoBERTa-Large for maximum performance.
-
Domain: If your data is far from the pre-training domain (e.g., biomedical text), consider domain-specific models like BioBERT, SciBERT, or ClinicalBERT.
-
Language: For non-English tasks, consider multilingual models (mBERT, XLM-RoBERTa) or language-specific models.
20.11.2 Handling Long Documents
BERT and its variants are limited to 512 tokens. For longer documents:
- Truncation: Simply cut the text at 512 tokens (loses information). Use head truncation for tasks where the beginning is most informative (news articles) or tail truncation for tasks where the end matters (reviews with a concluding sentence). Head+tail truncation (keeping the first 128 and last 384 tokens) often outperforms either alone.
- Chunking: Split the document into overlapping chunks, process each separately, and aggregate predictions. For classification, you can max-pool or average the
[CLS]representations across chunks. For token-level tasks, use a stride smaller than the chunk size to ensure every token appears in the "center" of at least one chunk (where representations are most accurate). - Hierarchical approach: Use BERT to encode chunks, then use another model (an LSTM or another Transformer) to combine chunk representations. This is effective for document classification and summarization.
- Use a long-range model: Longformer (Beltagy et al., 2020), BigBird (Zaheer et al., 2020), and LED support sequences up to 4,096 or 16,384 tokens using the efficient attention mechanisms we discussed in Chapter 18. These models replace standard full attention with a combination of local sliding-window attention and global attention tokens.
Practical tip: Before using a specialized long-document model, check whether simple truncation or chunking achieves acceptable performance. For many tasks (sentiment classification, intent detection), the first 512 tokens contain sufficient information, and the engineering complexity of long-range models is not warranted.
20.11.3 Mixed Precision Training
Fine-tuning large models benefits significantly from mixed precision training, which uses 16-bit floating point (FP16) for most operations:
training_args = TrainingArguments(
output_dir="./results",
fp16=True, # Enable mixed precision
per_device_train_batch_size=32, # Can use larger batches
# ...
)
Mixed precision approximately halves memory usage and can double training speed on modern GPUs with tensor cores.
20.11.4 Gradient Accumulation
When GPU memory is limited, gradient accumulation simulates larger batch sizes:
training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8, # Effective batch size = 4 * 8 = 32
# ...
)
20.11.5 Learning Rate Scheduling
The standard schedule for fine-tuning pre-trained models is linear warmup followed by linear decay:
$$\text{lr}(t) = \begin{cases} \text{lr}_{\max} \cdot \frac{t}{t_{\text{warmup}}} & \text{if } t < t_{\text{warmup}} \\ \text{lr}_{\max} \cdot \frac{T - t}{T - t_{\text{warmup}}} & \text{otherwise} \end{cases}$$
where $t$ is the current step, $t_{\text{warmup}}$ is the number of warmup steps, and $T$ is the total number of training steps. This schedule is the default in HuggingFace's Trainer.
20.12 The Broader Pre-training Landscape
20.12.1 Pre-training Objectives Taxonomy
Different models use different pre-training strategies:
| Objective | Models | Type | Bidirectional? |
|---|---|---|---|
| Masked Language Model (MLM) | BERT, RoBERTa | Encoder | Yes |
| Causal Language Model (CLM) | GPT, GPT-2 | Decoder | No (left-to-right) |
| Span Corruption | T5, BART | Encoder-Decoder | Encoder: Yes |
| Replaced Token Detection | ELECTRA | Encoder | Yes |
| Sentence Order Prediction | ALBERT | Encoder | Yes |
| Denoising Autoencoder | BART | Encoder-Decoder | Encoder: Yes |
20.12.2 ELECTRA: A More Efficient Alternative
ELECTRA (Clark et al., 2020) replaces MLM with a replaced token detection objective. A small generator network (typically 1/4 to 1/3 the size of the main model) produces plausible replacements for masked tokens, and the main model (discriminator) predicts which tokens have been replaced:
$$\mathcal{L}_{\text{ELECTRA}} = -\sum_{i=1}^{n} \left[ y_i \log D(\mathbf{x}, i) + (1-y_i) \log(1 - D(\mathbf{x}, i)) \right]$$
where $y_i = 1$ if token $i$ was replaced and $D(\mathbf{x}, i)$ is the discriminator's prediction.
Why ELECTRA is more efficient. In BERT's MLM, the model only receives a training signal for the 15% of tokens that are masked. The other 85% of tokens are "wasted" --- the model processes them but does not learn from them. ELECTRA's replaced token detection learns from all tokens: at every position, the model must decide whether the token is original or replaced. This 6.7x increase in training signal per sequence makes ELECTRA dramatically more sample-efficient.
In practice, ELECTRA-Small (14M parameters) outperforms GPT (117M parameters) on the GLUE benchmark, and ELECTRA-Base matches RoBERTa-Large while using less than 1/4 of the compute. For practitioners with limited compute budgets, ELECTRA provides excellent performance at low cost.
The connection to GANs (Chapter 17) is worth noting: ELECTRA's generator-discriminator setup resembles a GAN, but the generator is trained with maximum likelihood (not adversarial loss), and the discriminator performs token-level binary classification rather than generating data. This avoids the training instability of adversarial methods.
20.12.3 BART: Denoising Sequence-to-Sequence
BART (Lewis et al., 2020) generalizes BERT's denoising approach with an encoder-decoder architecture. The encoder processes a corrupted document, and the decoder reconstructs the original. Corruption strategies include:
- Token masking (like BERT)
- Token deletion (model must determine which positions are missing)
- Sentence permutation
- Document rotation (model must find the true start)
- Text infilling (replacing spans with a single
[MASK], requiring the model to predict span length)
20.13 Pre-training Data Curation
20.13.1 The Importance of Data Quality
The quality and composition of pre-training data has a profound impact on model performance --- often more so than architectural choices. As the saying in the field goes, "data is the new algorithm."
Key principles for pre-training data curation:
Deduplication. Training data often contains many duplicate or near-duplicate documents (especially when sourced from web crawls). Training on duplicated data causes the model to memorize specific sequences rather than learning general patterns. Lee et al. (2022) showed that deduplication improves perplexity and downstream performance while reducing training time. Common deduplication approaches include exact deduplication (hash-based), MinHash locality-sensitive hashing for near-duplicates, and suffix array-based methods.
Quality filtering. Not all text is equally useful for learning language. Common filtering strategies include:
- Language identification: Remove non-target-language text using fastText or similar classifiers.
- Perplexity filtering: Use a pre-trained language model to score text quality. Very high perplexity text (gibberish, OCR errors) and very low perplexity text (repetitive boilerplate) are removed.
- Heuristic filters: Remove pages with too few sentences, too many special characters, too many duplicate lines, or insufficient natural language (e.g., code-heavy pages).
- Toxicity filtering: Remove content that is offensive, harmful, or inappropriate using toxicity classifiers.
Domain balance. The mixture of domains in the training data affects what the model learns. A model trained predominantly on news text will write like a journalist; one trained on scientific papers will adopt academic style. Careful domain balancing ensures the model develops broad capabilities. The Pile (Gao et al., 2020) is a well-known curated dataset that intentionally combines diverse sources including academic papers, books, code, web text, and dialogue.
20.13.2 Scaling Laws for Pre-training Data
Hoffmann et al. (2022) ("Chinchilla" paper) demonstrated that many large language models were undertrained --- they used too many parameters relative to the amount of training data. The key finding is that model size and training data should scale roughly equally:
$$N_{\text{optimal}} \approx 20 \times D$$
where $N_{\text{optimal}}$ is the optimal number of parameters and $D$ is the number of training tokens. This means a 1 billion parameter model should be trained on approximately 20 billion tokens, and a 70 billion parameter model should see approximately 1.4 trillion tokens.
This scaling law has practical implications: rather than training the largest possible model on a fixed dataset, it is often better to train a smaller model on more data. The LLaMA family of models adopted this approach, training relatively smaller models (7B--65B parameters) on 1--1.4 trillion tokens, and achieved performance competitive with much larger models.
20.14 Advanced Fine-Tuning: Parameter-Efficient Methods
20.14.1 The Problem with Full Fine-Tuning
Full fine-tuning updates all model parameters, which has several drawbacks for large models:
- Storage cost: Each task requires a separate copy of the entire model. For a 340M parameter BERT-Large, that is 1.3 GB per task.
- Compute cost: Computing gradients for all parameters is expensive.
- Catastrophic forgetting risk: Updating all parameters can overwrite useful pre-trained knowledge, especially with small datasets.
Parameter-efficient fine-tuning (PEFT) methods address these issues by updating only a small fraction of parameters while keeping most of the model frozen.
20.14.2 LoRA: Low-Rank Adaptation
LoRA (Hu et al., 2022) is the most widely used PEFT method. Instead of updating a weight matrix $\mathbf{W} \in \mathbb{R}^{d \times d}$ directly, LoRA decomposes the update into a low-rank product:
$$\mathbf{W}' = \mathbf{W} + \Delta\mathbf{W} = \mathbf{W} + \mathbf{B}\mathbf{A}$$
where $\mathbf{B} \in \mathbb{R}^{d \times r}$ and $\mathbf{A} \in \mathbb{R}^{r \times d}$ are the trainable low-rank matrices, and $r \ll d$ is the rank (typically 4--64).
Intuition. The hypothesis is that the weight updates needed for fine-tuning lie in a low-dimensional subspace. Instead of updating a $768 \times 768$ matrix (589,824 parameters), we update a $768 \times 8$ and $8 \times 768$ matrix (12,288 parameters) --- a 48x reduction.
The key advantages of LoRA: - Same inference speed: At inference time, $\mathbf{B}\mathbf{A}$ is merged into $\mathbf{W}$, adding zero latency. - Tiny storage per task: Only the low-rank matrices need to be saved (typically < 1% of model size). - Composable: Different LoRA adaptations can be swapped in and out at inference time.
20.14.3 Adapters and Prefix Tuning
Other PEFT approaches include:
Adapters (Houlsby et al., 2019): Insert small bottleneck layers (down-project, nonlinearity, up-project) between existing Transformer layers. Only the adapter parameters are trained. With adapter dimension 64, this adds about 3.6% parameters to BERT-Base while achieving 97--99% of full fine-tuning performance.
Prefix tuning (Li and Liang, 2021): Prepend trainable "prefix" vectors to the keys and values at each layer. These prefixes act as task-specific context that steers the model's behavior without changing any model weights. This is particularly effective for generation tasks.
Prompt tuning (Lester et al., 2021): A simpler version of prefix tuning that adds trainable "soft prompt" tokens to the input embedding layer only. With T5-XXL (11B parameters), prompt tuning with just 20,000 trainable parameters achieves performance competitive with full fine-tuning.
20.15 Connecting to What Comes Next
In this chapter, we have covered the foundational paradigm of modern NLP: pre-train on large unlabeled corpora, then fine-tune or extract features for downstream tasks. We introduced the HuggingFace ecosystem that makes this practical and explored the family of models that emerged from this paradigm.
The models discussed here---BERT, RoBERTa, T5, and their variants---represent the encoder-centric and encoder-decoder branches of the Transformer family tree. In Chapter 21, we will explore the decoder-only branch: autoregressive language models like GPT and the techniques for generating coherent text from them. In Chapter 22, we will examine how these models scale to billions of parameters and the emergent capabilities that arise.
The transfer learning paradigm is perhaps the single most important concept in this book. It transformed NLP from a field where each task required bespoke architectures and massive labeled datasets into one where a pre-trained foundation can be adapted to virtually any task with modest labeled data. As you continue through this book, every model you encounter will build on these foundations.
Summary
This chapter covered the evolution from static word embeddings to modern pre-trained language models:
- Static embeddings (Word2Vec, GloVe) provide fixed word representations that cannot capture context-dependent meaning. They laid the groundwork by demonstrating that unsupervised learning on large corpora produces useful representations.
- ELMo introduced contextualized embeddings using bidirectional LSTMs, demonstrating that deep, context-dependent representations dramatically improve downstream task performance.
- BERT leverages the Transformer encoder with Masked Language Modeling and Next Sentence Prediction to learn deep bidirectional representations. Its architecture walkthrough reveals how 12 layers of self-attention progressively build linguistic understanding.
- Subword tokenization (BPE, WordPiece, SentencePiece) balances vocabulary size with coverage, enabling models to handle any text. Understanding tokenization is essential because it determines how the model "sees" text.
- Transfer learning theory explains why pre-training works: the model develops general linguistic competence that transfers to specific tasks with minimal parameter updates.
- The HuggingFace ecosystem provides the practical tools (Transformers, Tokenizers, Datasets) for working with pre-trained models, making state-of-the-art NLP accessible to practitioners.
- Fine-tuning adapts pre-trained models to downstream tasks by updating all parameters with a small learning rate, while feature extraction freezes the model and trains only a classifier. Advanced strategies include layer-wise learning rate decay and partial fine-tuning.
- RoBERTa shows that better training recipes improve BERT significantly. ALBERT reduces parameters through factorization and sharing. DistilBERT compresses BERT through knowledge distillation.
- T5 unifies all NLP tasks into a text-to-text framework using an encoder-decoder architecture with span corruption pre-training.
- Pre-training data curation --- including deduplication, quality filtering, and domain balance --- is as important as model architecture for achieving strong performance.
- Parameter-efficient fine-tuning methods (LoRA, adapters, prefix tuning) enable adapting large models to new tasks while updating less than 1% of parameters.
The pre-training and transfer learning paradigm has fundamentally transformed NLP, making it possible to achieve strong performance on virtually any language task with modest labeled data and compute. In the next chapter, we will see how decoder-only models like GPT extend these ideas to autoregressive text generation.
Related Reading
Explore this topic in other books
College Football Analytics NLP & Scouting Reports Sports Betting NLP for Betting Intelligence Prediction Markets NLP & Sentiment Analysis