Pre-training and transfer learning transformed NLP from a field of task-specific architectures into one where a single model, trained on massive unlabeled text, can be adapted to virtually any downstream language task with minimal labeled data. BERT demonstrated that bidirectional Transformer encoders, pre-trained with masked language modeling, produce rich contextual representations that transfer across tasks. The HuggingFace ecosystem made this accessible to every practitioner.
From Static to Contextualized Embeddings
Static embeddings (Word2Vec, GloVe) assign one vector per word type, losing context-dependent meaning (the polysemy problem).
ELMo introduced contextualized embeddings via bidirectional LSTMs, producing different representations for the same word in different contexts.
BERT advanced this with deep bidirectional Transformer encoders, enabling each token to attend to all other tokens simultaneously.
Special tokens: [CLS] for classification, [SEP] for sentence boundaries.
Pre-training Objectives
Masked Language Modeling (MLM): Randomly mask 15% of tokens (80% [MASK], 10% random, 10% unchanged) and predict them. Enables bidirectional context without information leakage.
Next Sentence Prediction (NSP): Binary classification of whether sentence B follows sentence A. Later shown to be unnecessary by RoBERTa.
Key Numbers
Parameter
BERT-Base
BERT-Large
Layers
12
24
Hidden size
768
1024
Attention heads
12
16
Parameters
110M
340M
Tokenization
Algorithm
Used By
Merge Criterion
Special Prefix
BPE
GPT-2, RoBERTa
Most frequent pair
Space prefix
WordPiece
BERT, DistilBERT
Maximum likelihood
## continuation
SentencePiece
T5, ALBERT
Unigram LM or BPE
Language-agnostic
Subword tokenization balances vocabulary size, sequence length, and the ability to handle rare/novel words.
Tokenizer choice affects downstream performance and must match the pre-trained model.
The HuggingFace Ecosystem
transformers: Unified API for thousands of pre-trained models (AutoModel, AutoTokenizer, Trainer).
tokenizers: Fast Rust-backed tokenizer implementations.
datasets: Efficient loading of NLP benchmarks with memory-mapped storage.
Auto classes inspect model configs and automatically instantiate the correct architecture.
Fine-Tuning vs. Feature Extraction
Strategy
What Is Updated
When to Use
Full fine-tuning
All model + head parameters
Large labeled dataset, sufficient compute
Feature extraction
Only the classifier head
Small dataset, rapid prototyping, multi-task
Partial fine-tuning
Upper layers + head
Moderate data, domain shift
Fine-Tuning Hyperparameters
Learning rate: 2e-5 to 5e-5 (much smaller than training from scratch)