Chapter 20: Quiz

Test your understanding of pre-training and transfer learning for NLP. Each question has one correct answer unless stated otherwise.


Question 1

What is the primary limitation of static word embeddings like Word2Vec and GloVe?

  • (a) They cannot handle out-of-vocabulary words
  • (b) They assign the same vector to a word regardless of its context
  • (c) They require labeled data for training
  • (d) They only work for English text
Show Answer **(b) They assign the same vector to a word regardless of its context.** Static embeddings produce a single fixed vector per word type. The word "bank" receives the same representation whether it refers to a financial institution or a river bank. This polysemy problem is the fundamental motivation for contextualized embeddings like ELMo and BERT.

Question 2

How does ELMo generate contextualized word representations?

  • (a) Using a Transformer encoder with masked language modeling
  • (b) By computing a task-specific weighted combination of bidirectional LSTM layer outputs
  • (c) By concatenating Word2Vec and GloVe embeddings
  • (d) Using attention over a fixed vocabulary of word senses
Show Answer **(b) By computing a task-specific weighted combination of bidirectional LSTM layer outputs.** ELMo runs a forward and backward LSTM over the input and computes $\mathbf{ELMo}_k^{task} = \gamma^{task} \sum_{j=0}^{L} s_j^{task} \mathbf{h}_{k,j}$, where the mixture weights $s_j$ are learned for each downstream task.

Question 3

What pre-training objective enables BERT to use bidirectional context, unlike traditional language models?

  • (a) Next Sentence Prediction (NSP)
  • (b) Causal Language Modeling (CLM)
  • (c) Masked Language Modeling (MLM)
  • (d) Span Corruption
Show Answer **(c) Masked Language Modeling (MLM).** Traditional language models are unidirectional because predicting the next token would allow the model to trivially "see" the answer in bidirectional context. MLM avoids this by masking random tokens and predicting them from the surrounding (non-masked) context on both sides.

Question 4

In BERT's MLM, what is the 80/10/10 masking strategy?

  • (a) 80% of tokens are masked, 10% are used for NSP, 10% are discarded
  • (b) 80% of selected tokens are replaced with [MASK], 10% with a random token, 10% are kept unchanged
  • (c) 80% of training uses MLM, 10% uses NSP, 10% uses neither
  • (d) 80% of layers use masking, 10% use dropout, 10% use neither
Show Answer **(b) 80% of selected tokens are replaced with [MASK], 10% with a random token, 10% are kept unchanged.** This strategy mitigates the distribution mismatch between pre-training (where [MASK] tokens appear) and fine-tuning (where they do not). By sometimes keeping the original token or using a random replacement, the model learns not to rely on the [MASK] token itself.

Question 5

Which component of BERT's input representation distinguishes between two sentences in a sentence-pair task?

  • (a) Token embeddings
  • (b) Position embeddings
  • (c) Segment embeddings
  • (d) Attention mask
Show Answer **(c) Segment embeddings.** BERT's input is the sum of token embeddings, position embeddings, and segment embeddings. The segment embeddings assign different vectors to tokens from Sentence A vs. Sentence B, allowing the model to distinguish which sentence each token belongs to.

Question 6

What does the [CLS] token in BERT represent after processing?

  • (a) The embedding of the first word in the sentence
  • (b) A learned separator between sentences
  • (c) An aggregate representation used for classification tasks
  • (d) A padding token that is ignored during training
Show Answer **(c) An aggregate representation used for classification tasks.** The [CLS] token is prepended to every input. Through self-attention, it attends to all other tokens and accumulates information from the entire sequence. Its final hidden state is used as input to classification heads during fine-tuning.

Question 7

Which subword tokenization method does BERT use?

  • (a) Byte-Pair Encoding (BPE)
  • (b) WordPiece
  • (c) SentencePiece
  • (d) Character-level tokenization
Show Answer **(b) WordPiece.** BERT uses WordPiece tokenization, which selects merges based on likelihood maximization rather than frequency. WordPiece uses the `##` prefix to indicate continuation subwords (e.g., "embeddings" becomes ["embed", "##ding", "##s"]).

Question 8

How does Byte-Pair Encoding (BPE) determine which token pairs to merge?

  • (a) By maximizing the likelihood of the training data
  • (b) By merging the most frequent adjacent pair of tokens
  • (c) By selecting the pair that minimizes perplexity
  • (d) By randomly selecting pairs weighted by frequency
Show Answer **(b) By merging the most frequent adjacent pair of tokens.** BPE iteratively counts all adjacent token pairs and merges the most frequent one into a new token. WordPiece, by contrast, selects merges based on likelihood rather than raw frequency.

Question 9

What is the key advantage of SentencePiece over BPE and WordPiece?

  • (a) It produces smaller vocabularies
  • (b) It is faster during inference
  • (c) It treats input as raw Unicode and does not require language-specific pre-tokenization
  • (d) It always produces the shortest possible token sequences
Show Answer **(c) It treats input as raw Unicode and does not require language-specific pre-tokenization.** SentencePiece processes raw character streams including whitespace, making it language-agnostic. Traditional BPE and WordPiece typically require a pre-tokenization step (e.g., splitting on whitespace) that embeds language-specific assumptions.

Question 10

What is the recommended learning rate range for fine-tuning BERT?

  • (a) 1e-2 to 1e-1
  • (b) 1e-4 to 1e-3
  • (c) 2e-5 to 5e-5
  • (d) 1e-7 to 1e-6
Show Answer **(c) 2e-5 to 5e-5.** Fine-tuning uses a much smaller learning rate than training from scratch. Large learning rates would destroy the pre-trained representations (catastrophic forgetting). The original BERT paper recommends trying 2e-5, 3e-5, and 5e-5.

Question 11

Which of the following changes did RoBERTa make to improve upon BERT? (Select all that apply.)

  • (a) Removed Next Sentence Prediction
  • (b) Used a Transformer decoder instead of encoder
  • (c) Trained with larger batches and more data
  • (d) Used dynamic masking instead of static masking
Show Answer **(a), (c), and (d).** RoBERTa removed NSP, used dynamic masking (generating masks on-the-fly rather than once during preprocessing), trained with much larger batches (8,192 vs. 256), and trained on significantly more data (160GB vs. 16GB). It did not change the architecture---it still uses the Transformer encoder.

Question 12

How does ALBERT reduce the number of parameters compared to BERT?

  • (a) By using fewer attention heads
  • (b) By factorizing the embedding matrix and sharing parameters across layers
  • (c) By using knowledge distillation from a teacher model
  • (d) By reducing the vocabulary size
Show Answer **(b) By factorizing the embedding matrix and sharing parameters across layers.** ALBERT introduces two parameter reduction techniques: (1) factorized embedding parameterization, which projects through a smaller embedding dimension before the hidden dimension, and (2) cross-layer parameter sharing, where all Transformer layers use the same weights.

Question 13

In knowledge distillation (as used in DistilBERT), what does the "temperature" parameter control?

  • (a) The learning rate decay schedule
  • (b) The softness of the probability distribution used for training
  • (c) The number of layers in the student model
  • (d) The proportion of data used for distillation vs. standard training
Show Answer **(b) The softness of the probability distribution used for training.** Higher temperature produces a softer (more uniform) probability distribution from the teacher's logits, which transfers more information about the relationships between classes (e.g., which incorrect answers are "almost right"). At $T=1$, this is the standard softmax; as $T \to \infty$, the distribution becomes uniform.

Question 14

What percentage of BERT's performance does DistilBERT retain while being 40% smaller?

  • (a) About 80%
  • (b) About 90%
  • (c) About 97%
  • (d) About 99%
Show Answer **(c) About 97%.** DistilBERT retains approximately 97% of BERT's language understanding capabilities (as measured on the GLUE benchmark) while being 40% smaller and 60% faster. This makes it attractive for deployment scenarios with latency or resource constraints.

Question 15

How does T5 differ from BERT in its approach to NLP tasks?

  • (a) T5 uses only the decoder, while BERT uses only the encoder
  • (b) T5 casts all tasks as text-to-text problems using an encoder-decoder architecture
  • (c) T5 does not use pre-training
  • (d) T5 uses word-level tokenization instead of subword tokenization
Show Answer **(b) T5 casts all tasks as text-to-text problems using an encoder-decoder architecture.** T5 reformulates every NLP task as generating output text from input text. Classification, translation, summarization, and more all use the same encoder-decoder model with the same training procedure. Tasks are distinguished by text prefixes (e.g., "translate English to French:").

Question 16

What is T5's pre-training objective?

  • (a) Masked Language Modeling
  • (b) Next Sentence Prediction
  • (c) Span corruption (masking contiguous spans and predicting them)
  • (d) Causal language modeling
Show Answer **(c) Span corruption (masking contiguous spans and predicting them).** T5 replaces contiguous spans of tokens with sentinel tokens and trains the model to generate the missing spans. This is a generalization of BERT's single-token masking that requires understanding broader context and span boundaries.

Question 17

In feature extraction mode, what happens to the pre-trained model's parameters during downstream training?

  • (a) They are randomly re-initialized
  • (b) They are updated with a very small learning rate
  • (c) They are frozen and not updated
  • (d) Only the attention weights are updated
Show Answer **(c) They are frozen and not updated.** In feature extraction, the pre-trained model serves as a fixed feature extractor. Its parameters are frozen (requires_grad = False), and only the task-specific classifier on top is trained. This is faster and requires less memory than fine-tuning.

Question 18

Which strategy is recommended for a scenario with a very small labeled dataset (fewer than 500 examples)?

  • (a) Full fine-tuning with a large learning rate
  • (b) Feature extraction or partial fine-tuning
  • (c) Training from scratch with data augmentation
  • (d) Using only the embedding layer of BERT
Show Answer **(b) Feature extraction or partial fine-tuning.** With very few labeled examples, full fine-tuning risks overfitting to the small dataset and catastrophic forgetting of the pre-trained knowledge. Feature extraction (frozen model) or partial fine-tuning (freeze lower layers) preserves more of the pre-trained representations while still adapting to the task.

Question 19

What linguistic information do BERT's lower layers (1-4) primarily encode?

  • (a) Semantic roles and coreference
  • (b) Surface-level features like part-of-speech
  • (c) Document-level topic information
  • (d) Task-specific classification features
Show Answer **(b) Surface-level features like part-of-speech.** Research has shown that BERT's layers encode a hierarchy of linguistic information: lower layers capture surface-level features (POS tags, simple syntax), middle layers capture syntactic structure (parse trees, dependencies), and upper layers capture semantic features (coreference, entity types, semantic roles).

Question 20

How does ELECTRA's pre-training objective differ from BERT's MLM?

  • (a) ELECTRA predicts the next sentence instead of masked tokens
  • (b) ELECTRA trains a discriminator to detect which tokens have been replaced by a generator
  • (c) ELECTRA uses a denoising autoencoder objective
  • (d) ELECTRA masks entire sentences instead of individual tokens
Show Answer **(b) ELECTRA trains a discriminator to detect which tokens have been replaced by a generator.** ELECTRA uses a small generator to produce plausible token replacements, then trains the main model (discriminator) to classify each token as original or replaced. This is more sample-efficient than MLM because the model learns from all tokens, not just the 15% that are masked.

Question 21

What is the maximum input sequence length for BERT-Base?

  • (a) 128 tokens
  • (b) 256 tokens
  • (c) 512 tokens
  • (d) 1024 tokens
Show Answer **(c) 512 tokens.** BERT's learned position embeddings support a maximum of 512 positions. For longer documents, strategies such as truncation, chunking, or using long-range models (Longformer, BigBird) are needed.

Question 22

What is the purpose of the warmup period in the learning rate schedule for fine-tuning?

  • (a) To reduce the batch size gradually
  • (b) To gradually increase the learning rate from zero to prevent large, destabilizing updates early in training
  • (c) To pre-compute the embeddings before fine-tuning begins
  • (d) To evaluate the model on a validation set before training
Show Answer **(b) To gradually increase the learning rate from zero to prevent large, destabilizing updates early in training.** Early in fine-tuning, the gradients can be noisy because the classification head is randomly initialized while the base model is pre-trained. The warmup period gradually increases the learning rate, giving the model time to stabilize before applying the full learning rate.

Question 23

Which HuggingFace class would you use to automatically load the correct tokenizer for any model checkpoint?

  • (a) BertTokenizer
  • (b) PreTrainedTokenizer
  • (c) AutoTokenizer
  • (d) TokenizerFactory
Show Answer **(c) AutoTokenizer.** The Auto classes (AutoTokenizer, AutoModel, AutoModelForSequenceClassification, etc.) inspect the model configuration and automatically instantiate the correct class. This provides a unified interface across all model architectures.

Question 24

ALBERT replaces BERT's Next Sentence Prediction with Sentence Order Prediction. How does SOP differ from NSP?

  • (a) SOP uses three classes instead of two
  • (b) SOP predicts whether two consecutive sentences are in the correct order (vs. NSP which uses random negatives)
  • (c) SOP operates at the paragraph level instead of the sentence level
  • (d) SOP is only used during fine-tuning, not pre-training
Show Answer **(b) SOP predicts whether two consecutive sentences are in the correct order (vs. NSP which uses random negatives).** NSP's negative examples (random sentences from the corpus) are often trivially distinguishable by topic alone, so the model learns topic matching rather than coherence. SOP uses the same two consecutive sentences but swaps their order, forcing the model to learn discourse coherence.

Question 25

What is mixed precision training, and why is it beneficial for fine-tuning large models?

  • (a) Training on a mix of labeled and unlabeled data to improve generalization
  • (b) Using 16-bit floating point for most operations, reducing memory usage and increasing speed
  • (c) Mixing different model architectures during training for better representations
  • (d) Alternating between fine-tuning and feature extraction during training
Show Answer **(b) Using 16-bit floating point for most operations, reducing memory usage and increasing speed.** Mixed precision training (FP16) approximately halves memory usage, allowing larger batch sizes, and can double training speed on GPUs with tensor cores. Critical operations (like loss computation) still use FP32 to maintain numerical stability.