Chapter 20: Exercises

Conceptual Exercises

Exercise 20.1: Static vs. Contextualized Embeddings

Explain why the sentence "The bank approved the loan" and "The river bank was steep" pose a fundamental problem for Word2Vec embeddings but not for BERT. What specific mechanism in BERT enables it to produce different representations for "bank" in these two sentences?

Exercise 20.2: MLM Masking Strategy

BERT's Masked Language Modeling uses an 80/10/10 strategy (80% [MASK], 10% random, 10% unchanged). Explain the purpose of each component. What would happen if BERT always replaced selected tokens with [MASK]?

Exercise 20.3: NSP Critique

RoBERTa removes Next Sentence Prediction and achieves better results. Propose two hypotheses for why NSP might hurt performance. How does ALBERT's Sentence Order Prediction (SOP) address these issues?

Exercise 20.4: Subword Tokenization Comparison

Given the word "unhappiness," show how each tokenizer might segment it: - (a) Word-level tokenizer with a 50,000-word vocabulary - (b) Character-level tokenizer - (c) BPE tokenizer (assume common subwords: "un", "happi", "ness") - (d) WordPiece tokenizer (assume common subwords: "un", "##happi", "##ness")

Discuss the tradeoffs of each approach in terms of vocabulary size, sequence length, and semantic coverage.

Exercise 20.5: Feature Extraction vs. Fine-Tuning

You have a medical text classification task with only 200 labeled examples. Your pre-trained model is BERT-Base (trained on general-domain text). Should you use feature extraction, full fine-tuning, or partial fine-tuning? Justify your answer and describe what risks each approach carries in this low-data, domain-shift scenario.

Exercise 20.6: Temperature in Knowledge Distillation

In DistilBERT's distillation loss, the temperature parameter $T$ controls the softness of the probability distribution. Given logits $[2.0, 1.0, 0.5]$: - (a) Compute the softmax probabilities with $T=1$. - (b) Compute the softmax probabilities with $T=5$. - (c) Explain why a higher temperature transfers more information from the teacher to the student.

Exercise 20.7: T5 Text-to-Text Formulation

Convert each of the following NLP tasks into T5-style text-to-text format (provide both the input prefix and the expected target): - (a) Named Entity Recognition for the sentence "Barack Obama visited Paris." - (b) Paraphrase detection for "The cat sat on the mat" and "A cat was sitting on the mat." - (c) Text entailment: premise = "All dogs are animals," hypothesis = "My poodle is an animal." - (d) Grammar correction for "She don't like apples."

Exercise 20.8: Positional Encoding Trade-offs

BERT uses learned absolute position embeddings, while T5 uses relative position biases. Compare these approaches: - (a) Which approach can generalize to sequence lengths not seen during training? - (b) Which has more parameters? - (c) How does each handle the relationship between tokens at positions 5 and 7 vs. tokens at positions 100 and 102?

Exercise 20.9: ALBERT Parameter Efficiency

BERT-Base has a vocabulary of 30,522 tokens, an embedding dimension of 768, and 12 Transformer layers each with approximately 7.1M parameters (in self-attention and FFN). - (a) Calculate the number of parameters in BERT's embedding matrix. - (b) Calculate the number of parameters in BERT's 12 Transformer layers. - (c) For ALBERT with $E=128$ and shared layers, calculate the corresponding parameter counts. - (d) What is the total parameter reduction factor?

Exercise 20.10: Catastrophic Forgetting

Define catastrophic forgetting in the context of fine-tuning pre-trained models. Propose three strategies (beyond simply using a small learning rate) to mitigate it.

Coding Exercises

Exercise 20.11: Custom Tokenizer Training

Using the HuggingFace tokenizers library, train a BPE tokenizer on the following corpus with a vocabulary size of 50. Print the tokenization of "the lowest number."

corpus = [
    "the lowest number is the best",
    "the newest findings are the best",
    "the widest river flows the lowest",
    "lower numbers are newer than higher ones",
]

Exercise 20.12: Tokenizer Comparison

Write a script that tokenizes the sentence "Transformers revolutionized natural language processing" using the tokenizers from (a) bert-base-uncased, (b) roberta-base, and (c) t5-small. Print the tokens and token IDs for each. Compare and discuss the differences.

Exercise 20.13: Extracting BERT Layer Representations

Write a function that takes a sentence and returns the hidden states from all 12 BERT layers (plus the embedding layer). Compute the cosine similarity of the word "bank" between the following two sentences at each layer: - "I deposited money at the bank." - "The river bank was covered in flowers."

Plot or print how the similarity changes across layers.

Exercise 20.14: Mean Pooling vs. [CLS] Token

Implement both mean pooling and [CLS]-based sentence embeddings using BERT. Compute the cosine similarity between the following pairs: - "The cat sat on the mat." and "A feline rested on the rug." - "The cat sat on the mat." and "Stock prices rose sharply today."

Compare which strategy better captures semantic similarity.

Exercise 20.15: Masked Language Model Prediction

Write a function that takes a sentence with a [MASK] token and returns the top-5 predicted tokens with their probabilities using bert-base-uncased. Test it with: - "The capital of France is [MASK]." - "She is a brilliant [MASK]." - "The [MASK] barked loudly at the mailman."

Exercise 20.16: Fine-Tuning with Frozen Layers

Modify the fine-tuning example from Section 20.5 to freeze the first 6 layers of BERT. Compare the number of trainable parameters with the full fine-tuning setup. Use the SST-2 dataset and train for 1 epoch with each configuration. Report the validation accuracy.

Exercise 20.17: Attention Visualization

Write code to extract and visualize the attention weights from a specific layer and head of BERT for the sentence "The cat that the dog chased ran away." Which attention head best captures the subject-verb relationship between "cat" and "ran"?

Exercise 20.18: Custom Classification Head

Instead of using AutoModelForSequenceClassification, manually implement a classification head on top of BERT. Your head should: - Take the [CLS] token representation - Apply dropout (p=0.3) - Pass through a hidden linear layer (768 -> 256) with ReLU activation - Apply dropout (p=0.3) again - Pass through the output linear layer (256 -> num_classes)

Compare this with the default single-linear-layer head on SST-2.

Exercise 20.19: T5 Multi-task Evaluation

Write a script that uses T5-small to perform three different tasks on the same input sentence "The movie was absolutely terrible and I loved every minute of it.": - (a) Sentiment analysis (use prefix "sst2 sentence:") - (b) Summarization (use prefix "summarize:") - (c) Grammar check (use prefix "cola sentence:")

Exercise 20.20: Subword Tokenization from Scratch

Implement BPE tokenization from scratch (without using the tokenizers library). Your implementation should: - Take a corpus and a desired vocabulary size - Perform iterative merging of the most frequent byte pairs - Return the learned merge rules and vocabulary - Tokenize new text using the learned rules

Test on a small corpus of at least 5 sentences.

Mathematical Exercises

Exercise 20.21: MLM Loss Computation

Given a sentence of 20 tokens where 3 tokens are masked, and the model produces the following log-probabilities for the correct tokens at the masked positions: $\log P(x_1) = -0.5$, $\log P(x_2) = -1.2$, $\log P(x_3) = -0.3$: - (a) Compute the MLM loss. - (b) If the vocabulary size is 30,522, what is the expected loss for a random predictor? - (c) What is the perplexity of the model on these masked tokens?

Exercise 20.22: Parameter Count Analysis

For a Transformer encoder layer with hidden dimension $H$, feed-forward dimension $4H$, and $A$ attention heads: - (a) Derive the exact parameter count for one layer (include all weight matrices, biases, and layer norms). - (b) Calculate this for BERT-Base ($H=768$, $A=12$). - (c) Calculate this for BERT-Large ($H=1024$, $A=16$).

Exercise 20.23: BPE Merge Computation

Given the following word frequencies: {"hug": 10, "pug": 5, "hugs": 12, "bugs": 4, "pugs": 2}: - (a) Write the initial character-level representation with end-of-word markers. - (b) Perform the first 3 BPE merges, showing the pair frequencies at each step. - (c) Show the vocabulary after each merge.

Exercise 20.24: Distillation Loss Derivation

Given teacher logits $\mathbf{z}^T = [3.0, 1.0, 0.2]$ and student logits $\mathbf{z}^S = [2.5, 1.5, 0.5]$ with temperature $T = 2$: - (a) Compute the soft probability distributions $P^T$ and $P^S$. - (b) Compute the KL divergence $D_{\text{KL}}(P^T \| P^S)$. - (c) Compute the distillation loss (which is $T^2 \cdot D_{\text{KL}}$). Explain why $T^2$ scaling is needed.

Exercise 20.25: ALBERT Embedding Factorization

Prove that the factorized embedding approach in ALBERT saves parameters when $E < \frac{V \cdot H}{V + H}$. For $V = 30{,}000$ and $H = 768$, what is the maximum value of $E$ that still reduces parameters?

Applied Exercises

Exercise 20.26: Domain-Specific Pre-training

Design an experiment to evaluate whether further pre-training BERT on a domain-specific corpus (e.g., legal or biomedical text) before fine-tuning improves performance on domain-specific tasks. Specify: - The pre-training corpus and procedure - The downstream evaluation task(s) - The baselines you would compare against - The metrics you would report

Exercise 20.27: Data Augmentation for Fine-Tuning

Propose three text data augmentation strategies that could improve fine-tuning performance when labeled data is scarce. For each strategy: - Describe the augmentation technique - Explain why it might help - Describe a potential failure mode - Write pseudocode for the implementation

Exercise 20.28: Model Selection Pipeline

You need to deploy a text classification system with the following constraints: latency < 20ms per example, minimum accuracy of 90% on your benchmark, and maximum model size of 100MB. Design a pipeline to evaluate and select from BERT-Base, DistilBERT, ALBERT-Base, and a fine-tuned T5-Small. What metrics would you track beyond accuracy?

Exercise 20.29: Multi-Task Fine-Tuning

Design a training procedure that fine-tunes a single BERT model on three tasks simultaneously: sentiment analysis (SST-2), natural language inference (MNLI), and paraphrase detection (QQP). Address: - How you would structure the input for each task - How you would handle different output formats - Your sampling strategy across tasks - How you would prevent one task from dominating

Exercise 20.30: Tokenizer Impact Analysis

Write an experiment that measures how tokenizer choice affects downstream task performance. Fine-tune the same BERT architecture (randomly initialized) with three different tokenizers (BPE with vocab 10K, 30K, and 50K) on a text classification task. Report: - Average tokens per example for each vocabulary size - Training time per epoch - Final accuracy - Analysis of which types of errors change with vocabulary size

Exercise 20.31: Probing Classifiers

Implement a set of probing classifiers to test what linguistic knowledge BERT encodes at different layers. Train simple linear classifiers on frozen BERT representations for: - Part-of-speech tagging - Named entity recognition - Dependency arc labeling At which layer does each task achieve peak performance?

Exercise 20.32: Cross-Lingual Transfer

Using multilingual BERT (mBERT), design an experiment for zero-shot cross-lingual transfer: fine-tune on English sentiment data and evaluate on French, German, and Spanish sentiment data (without any training data in those languages). Report accuracy for each language and discuss what factors affect cross-lingual transfer quality.

Exercise 20.33: Efficient Fine-Tuning Comparison

Compare three fine-tuning strategies on the same dataset and compute budget: - (a) Full fine-tuning of BERT-Base - (b) Feature extraction with a 2-layer MLP - (c) Partial fine-tuning (freeze first 9 layers)

Report: accuracy, training time, GPU memory usage, and number of trainable parameters.

Exercise 20.34: Span Corruption Implementation

Implement T5's span corruption pre-training objective from scratch. Given a tokenized sentence, your function should: - Select spans to corrupt (with mean span length 3 and corruption rate 15%) - Replace each span with a unique sentinel token - Generate the corresponding target sequence - Return both the corrupted input and the target

Exercise 20.35: ELECTRA-Style Pre-training

Implement a simplified version of ELECTRA's replaced token detection: - Train a small masked language model as the generator - Use the generator to produce replacements for masked tokens - Train a discriminator to classify each token as original or replaced Compare the training efficiency (loss convergence per step) with standard MLM on a small corpus.