Chapter 13: Quiz

Test your understanding of transfer learning, foundation models, and modern deep learning workflows. Answers follow each question.


Question 1

What is transfer learning, and why is it the default approach in modern deep learning practice?

Answer **Transfer learning** is the practice of using a model trained on one task (the source) as the starting point for a model on a different task (the target). It is the default approach because: (1) pretrained models encode features learned from massive datasets that are too expensive for most practitioners to replicate, (2) these features — especially early-layer features like edges, textures, and syntactic patterns — are general enough to transfer across tasks, and (3) empirically, fine-tuning a pretrained model almost always outperforms training from scratch when the target dataset is limited. Training from scratch is the exception, justified only when the domain is radically different from any available pretraining data, when labeled data is extremely abundant, or when model size constraints rule out pretrained models.

Question 2

Describe the four main strategies on the transfer learning spectrum (zero-shot, linear probe, fine-tuning, progressive unfreezing) and the conditions under which each is appropriate.

Answer 1. **Zero-shot inference**: Use the pretrained model directly with no task-specific training. Appropriate when no labeled data is available and the task is well-represented in the pretraining distribution (e.g., using CLIP for image classification with natural language class descriptions). 2. **Linear probe (feature extraction)**: Freeze the pretrained backbone entirely and train only a new linear classification head. Appropriate for small labeled datasets (100-1,000 examples) when the target domain is similar to the pretraining domain. Training is fast and deterministic (convex optimization for the linear head). 3. **Fine-tuning**: Unfreeze some or all pretrained layers and train with a small learning rate (typically with learning rate differential — lower LR for pretrained layers, higher for the new head). Appropriate for moderate labeled datasets (1,000-100,000) and moderate domain distance. 4. **Progressive unfreezing**: Gradually unfreeze layers from the last (most task-specific) to the first (most general), training each newly unfrozen group for a few epochs before unfreezing the next. Appropriate when domain distance is large and catastrophic forgetting is a risk. Introduced by Howard and Ruder (2018) in ULMFiT.

Question 3

What is the "transferability gradient," and how does it influence fine-tuning strategy?

Answer The **transferability gradient** is the observation that early layers of neural networks learn general, transferable features (edge detectors, color blobs for vision; syntactic patterns for language), while later layers learn increasingly task-specific features. This was demonstrated by Zeiler and Fergus (2014) and Yosinski et al. (2014) for CNNs on ImageNet. The gradient influences fine-tuning strategy because: (1) in linear probing, early-layer generality explains why frozen features work at all, (2) in fine-tuning, later layers need more adaptation than early layers, justifying discriminative learning rates (smaller LR for early layers, larger for later layers), and (3) in progressive unfreezing, the gradient dictates the unfreezing order — unfreeze from last to first because late layers need the most adaptation.

Question 4

What is domain shift, and what are the three main types?

Answer **Domain shift** occurs when the source (pretraining) distribution $p_S(\mathbf{x}, y)$ differs from the target distribution $p_T(\mathbf{x}, y)$. The three main types are: 1. **Covariate shift**: $p_S(\mathbf{x}) \neq p_T(\mathbf{x})$ but $p_S(y \mid \mathbf{x}) = p_T(y \mid \mathbf{x})$. The input distribution changes but the labeling function stays the same. Example: training on professional photographs, deploying on smartphone photos. 2. **Label shift** (prior probability shift): $p_S(y) \neq p_T(y)$ but $p_S(\mathbf{x} \mid y) = p_T(\mathbf{x} \mid y)$. The class frequencies change but the class-conditional distributions stay the same. Example: balanced training set deployed in an environment where one class dominates. 3. **Concept drift**: $p_S(y \mid \mathbf{x}) \neq p_T(y \mid \mathbf{x})$. The relationship between inputs and outputs changes. Example: user preferences change over time, so the same content now has different engagement patterns.

Question 5

What is negative transfer, and how can you detect it?

Answer **Negative transfer** occurs when fine-tuning a pretrained model produces worse performance than training from scratch. This happens when pretrained features are actively misleading for the target domain — they encode patterns that are irrelevant or counterproductive. Detection is straightforward: always compare the fine-tuned model against a trained-from-scratch baseline (even a simple one). If the from-scratch model outperforms the fine-tuned model, negative transfer is occurring. Causes include large domain distance, incompatible inductive biases (e.g., using a model pretrained on natural images for medical images with fundamentally different color distributions), and insufficient fine-tuning data to overcome misleading pretrained features.

Question 6

Explain the contrastive loss (InfoNCE) used in SimCLR and two-tower models. What role does the temperature parameter $\tau$ play?

Answer The **InfoNCE loss** (also called NT-Xent in SimCLR) computes: $$\mathcal{L} = -\log \frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j^+) / \tau)}{\sum_{k} \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k) / \tau)}$$ For a positive pair $(i, j^+)$, the loss encourages the model to assign high similarity to the positive pair relative to all negative pairs in the batch. The **temperature** $\tau$ controls the sharpness of the softmax distribution: low $\tau$ makes the distribution peaky (the model focuses on the hardest negatives — those with highest similarity), while high $\tau$ smooths the distribution (all negatives contribute equally). Empirically, moderate temperatures (0.05-0.5) work best; too low leads to training instability (focusing on a few very hard negatives), too high leads to insufficiently discriminative embeddings.

Question 7

Why do two-tower retrieval models use separate encoders for queries and items instead of a single shared encoder?

Answer Two-tower models use separate encoders for a critical **deployment reason**: item embeddings can be precomputed offline and stored in a vector index (FAISS), so at serving time only the query (user) tower needs to run online. This makes retrieval sublinear in the number of items — a nearest-neighbor search in the precomputed index. A shared encoder (also called a cross-encoder or interaction model) would require running inference on every (query, item) pair at serving time, which is linear in catalog size and infeasible for large catalogs (hundreds of thousands to millions of items). The tradeoff is that two-tower models cannot capture fine-grained query-item interactions (they only compute a dot product of independent embeddings), which is why they are used for candidate retrieval (top-100 to top-1000) followed by a more expensive cross-encoder for ranking (top-10 to top-100).

Question 8

What is masked language modeling, and why does it produce useful pretrained representations?

Answer **Masked language modeling (MLM)** is a self-supervised pretraining objective where a fraction of input tokens (typically 15%) are replaced with a `[MASK]` token, and the model is trained to predict the original tokens. This produces useful representations because predicting a masked token requires understanding both syntax (grammatical structure) and semantics (meaning and context). For example, predicting the mask in "The cat sat on the [MASK]" requires knowing that the answer should be a noun, likely referring to a surface. At scale (billions of tokens), this objective forces the model to learn deep linguistic knowledge — syntax, coreference, factual associations — that transfers to a wide range of downstream NLP tasks without any task-specific labels during pretraining.

Question 9

What is CLIP, and how does it enable zero-shot image classification?

Answer **CLIP (Contrastive Language-Image Pretraining)** is a model trained on 400 million image-text pairs from the internet. It has two encoders — an image encoder (ViT or ResNet) and a text encoder (Transformer) — trained with a symmetric contrastive loss that maximizes the cosine similarity of matching image-text pairs and minimizes it for non-matching pairs. CLIP enables **zero-shot image classification** by formulating classification as a retrieval problem: given an image, compute its similarity to text embeddings of class descriptions (e.g., "a photo of a dog", "a photo of a cat") and select the highest-similarity class. No task-specific training is needed — the model's understanding of natural language descriptions and visual concepts, learned from web-scale pretraining, is sufficient for many classification tasks.

Question 10

What is LoRA, and why has it become the default fine-tuning method for large models?

Answer **LoRA (Low-Rank Adaptation)** adds a low-rank perturbation to pretrained weight matrices: $W' = W + BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with $r \ll d$. Only $B$ and $A$ are trained; the original weights $W$ are frozen. LoRA has become the default for large models because: (1) it reduces trainable parameters by 100-1000x (enabling fine-tuning on consumer GPUs), (2) storage is minimal (only $B$ and $A$ per task, not a full model copy), (3) at deployment, the LoRA weights can be merged into the base weights ($W' = W + BA$) with **zero inference overhead** — the merged model has identical architecture and latency to the original, and (4) multiple LoRA adapters can be swapped in and out of a single base model for different tasks, enabling multi-task serving with a single GPU.

Question 11

What is the difference between adapters, LoRA, and prompt tuning? When would you choose each?

Answer | Method | How It Works | Trainable Params | Inference Cost | Best For | |--------|-------------|-----------------|----------------|----------| | **Adapters** | Insert small bottleneck modules (down-project, nonlinearity, up-project) after each transformer block | ~2-4% of model | Small overhead (extra forward pass through adapter layers) | Tasks requiring more expressiveness than LoRA | | **LoRA** | Add low-rank perturbation $BA$ to existing weight matrices | ~0.1-1% of model | None (merge at deployment) | Default choice for most fine-tuning tasks | | **Prompt tuning** | Prepend learnable "soft prompt" vectors to the input | ~0.01% of model | None | Very large models (>10B params) where even LoRA is expensive | Choose **LoRA** as the default because of its zero inference overhead. Choose **adapters** when you need more capacity than LoRA provides (complex tasks, large domain gaps). Choose **prompt tuning** for the largest models where storage and memory are critical constraints, or when you want to tune behavior without modifying any model weights.

Question 12

In a two-tower model with in-batch negatives, why does popularity bias arise, and how can it be mitigated?

Answer **Popularity bias** arises because popular items (those with many interactions) appear in more batches, and therefore serve as negatives more frequently. Each time an item appears as a negative, the model receives gradient signal to push it away from the user embedding. Popular items accumulate more "push away" signal than rare items, causing the model to underestimate their relevance. This creates a feedback loop: popular items are retrieved less, leading to fewer future interactions. **Mitigation**: The standard approach is **logQ correction** (Yi et al., 2019), which subtracts $\log p_j$ from the logit for item $j$, where $p_j$ is the item's sampling probability. This corrects for the non-uniform negative sampling distribution. Other approaches include: uniform negative sampling (sample negatives uniformly rather than from the batch), temperature scaling by item frequency, or post-hoc calibration of retrieval scores.

Question 13

What is the connection between contrastive learning and mutual information?

Answer Minimizing the InfoNCE contrastive loss maximizes a **lower bound on the mutual information** between the anchor and positive representations: $$I(\mathbf{x}; \mathbf{y}^+) \geq \log(K + 1) - \mathcal{L}_{\text{InfoNCE}}$$ where $K$ is the number of negatives. This bound (from Poole et al., 2019) tells us that contrastive learning implicitly maximizes the shared information between positive pairs while discarding information that does not help distinguish positives from negatives. The bound tightens with more negatives ($K$), which explains why larger batch sizes improve contrastive learning performance. However, the bound saturates at $\log(K+1)$, meaning there are diminishing returns beyond a batch size large enough to make the bound tight.

Question 14

Why is temporal splitting (not random splitting) essential when evaluating recommendation models?

Answer **Temporal splitting** divides data by time: train on interactions before time $t$, test on interactions after time $t$. This is essential because random splitting **leaks future information** into the training set. In recommendation, future interactions contain signals about user preference changes, new content trends, and seasonal patterns. A model evaluated with random splits can memorize future patterns that would not be available at serving time, producing optimistic metrics that do not reflect real-world performance. Temporal splitting simulates the production scenario: the model must predict future engagement using only historical data. This is also important for detecting temporal concept drift — a model that performs well on a random split but poorly on a temporal split likely relies on non-stationary patterns.

Question 15

What is a foundation model? How does it differ from a standard pretrained model?

Answer A **foundation model** (Bommasani et al., 2021) is a model trained on broad, diverse data at massive scale that serves as the foundation for many downstream tasks. It differs from a standard pretrained model in three ways: (1) **Scale** — foundation models are trained on orders of magnitude more data (billions of tokens/images vs. millions), (2) **Generality** — they transfer to a wide range of tasks without architectural modification, not just tasks similar to the pretraining task, and (3) **Emergence** — they exhibit capabilities that were not explicitly trained for (e.g., CLIP learning object segmentation, LLMs performing arithmetic, GPT-3 doing few-shot learning). A ResNet-50 pretrained on ImageNet is a pretrained model but not typically considered a foundation model, because its representations are specific to image classification. GPT-4, CLIP, and LLaMA are foundation models because they generalize across tasks in ways that their training objectives do not directly predict.

Question 16

Explain discriminative learning rates (learning rate differential) and why they improve fine-tuning.

Answer **Discriminative learning rates** assign different learning rates to different layer groups during fine-tuning, typically with lower rates for earlier (more general) layers and higher rates for later (more task-specific) layers and the new classification head. For example, the backbone might use LR = $10^{-4}$ while the head uses LR = $10^{-3}$. This improves fine-tuning because: (1) early layers contain general features (edges, textures, syntax) that are already useful — updating them too aggressively destroys these features (catastrophic forgetting), (2) later layers contain task-specific features that need more adaptation to the new task, and (3) the new head is randomly initialized and needs the largest learning rate to converge from scratch. The technique was introduced as "discriminative fine-tuning" by Howard and Ruder (2018) in ULMFiT.

Question 17

How does the HuggingFace Trainer API simplify the fine-tuning workflow?

Answer The `Trainer` API handles the complete fine-tuning workflow with minimal boilerplate: (1) **Training loop** — forward pass, loss computation, backward pass, optimizer step, gradient accumulation, (2) **Evaluation** — periodic evaluation on a validation set with custom metrics, (3) **Logging** — integration with TensorBoard, Weights & Biases, and other logging frameworks, (4) **Checkpointing** — automatic model saving with configurable strategy (every epoch, every N steps, or best model only), (5) **Mixed precision** — automatic FP16/BF16 training with `fp16=True`, (6) **Distributed training** — multi-GPU and multi-node training with DeepSpeed or FSDP through configuration flags, and (7) **Early stopping and best model selection** — `load_best_model_at_end=True` with configurable metric. This allows practitioners to focus on data preparation, model selection, and evaluation strategy rather than training infrastructure.

Question 18

A two-tower model achieves HR@100 = 0.25. What does this mean, and is it good?

Answer **HR@100 = 0.25** means that for 25% of users, the relevant item appears in the top 100 retrieved items. Whether this is "good" depends on context: - For **candidate retrieval** (the first stage of a multi-stage pipeline), HR@100 = 0.25 is the ceiling for all downstream ranking stages — the ranker can only reorder items that the retriever found. A common target is HR@100 > 0.5 for a retrieval stage. - The catalog size matters: retrieving the right item in the top 100 out of 200,000 items is much harder than out of 2,000 items. - Comparison with baselines: if random retrieval gives HR@100 = 0.05% and a popularity baseline gives HR@100 = 0.10, then 0.25 represents substantial improvement. In the StreamRec progressive project, Track A targets HR@100 > 0.15 (frozen encoders), Track B targets > 0.25, and Track C targets > 0.30.

Question 19

What is the role of the projection head in contrastive learning, and why is it discarded after pretraining?

Answer The **projection head** is a small MLP (typically 2 layers) that maps the backbone's representation to the space where the contrastive loss is computed. Chen et al. (2020) showed that using a projection head substantially improves representation quality in SimCLR — but critically, the representations *before* the projection head (the backbone output) transfer better to downstream tasks than the representations *after* it. The projection head is discarded after pretraining because it learns to discard information that is not useful for the contrastive task but *is* useful for downstream tasks. The contrastive loss cares only about distinguishing between augmented views, so the projection head can safely discard features (color, orientation) that are invariant across augmentations but informative for classification. The backbone retains all information, making it a better starting point for transfer.

Question 20

Why can LoRA weights be merged into the base model at deployment, and what is the practical significance of this property?

Answer LoRA modifies a linear layer by adding a low-rank perturbation: during training, the output is $\mathbf{y} = W\mathbf{x} + BA\mathbf{x} = (W + BA)\mathbf{x}$. At deployment, we can compute the merged weight $W' = W + BA$ once and replace the original layer. The merged model is architecturally identical to the original — same number of layers, same tensor shapes, same forward pass — so there is **zero inference overhead**. This is practically significant because: (1) serving costs are unchanged — you do not need a more expensive GPU or longer timeout, (2) multiple LoRA adapters for different tasks can be pre-merged into separate model copies, enabling multi-tenant serving, and (3) the LoRA weights themselves are small (typically < 1% of model size), so storing dozens of task-specific adapters costs far less than storing dozens of full model copies. This combination of training efficiency, storage efficiency, and zero inference overhead is why LoRA dominates parameter-efficient fine-tuning in production.