> — Andrej Karpathy, "A Recipe for Training Neural Networks" (2019)
In This Chapter
- Learning Objectives
- 13.1 The Paradigm Shift: From Training to Adapting
- 13.2 The Transfer Learning Spectrum
- 13.3 Domain Adaptation and Domain Shift
- 13.4 Self-Supervised Learning: How Pretrained Models Are Built
- 13.5 Foundation Models and the New Workflow
- 13.6 Two-Tower Models and Contrastive Learning for Retrieval
- 13.7 The HuggingFace Ecosystem
- 13.8 The Complete Modern DL Workflow
- 13.9 StreamRec Progressive Project — Milestone M5
- 13.10 Climate DL: Fine-Tuning a Vision Transformer for Satellite Imagery
- 13.11 Putting It All Together
Chapter 13: Transfer Learning, Foundation Models, and the Modern Deep Learning Workflow
"Don't be a hero. Use pretrained models." — Andrej Karpathy, "A Recipe for Training Neural Networks" (2019)
Learning Objectives
By the end of this chapter, you will be able to:
- Apply transfer learning strategies (feature extraction, fine-tuning, progressive unfreezing) for different data regimes
- Select and adapt pretrained foundation models for domain-specific tasks
- Design embedding pipelines that leverage pretrained models for downstream retrieval and classification
- Evaluate when to train from scratch vs. fine-tune vs. use a foundation model as-is
- Implement a complete modern DL workflow: pretrained backbone → adapter/fine-tune → evaluation → deployment
13.1 The Paradigm Shift: From Training to Adapting
Every model we have built so far in this book started with random weights. The MLP in Chapter 6 initialized its parameters from a Gaussian distribution. The CNN in Chapter 8 used Kaiming initialization. The transformer in Chapter 10 began as a randomly wired attention machine. In each case, the model learned everything — from low-level features to task-specific decision boundaries — from the labeled data you provided.
This approach has a name: training from scratch. And for most practitioners, in most situations, it is the wrong approach.
Here is the dirty secret of modern deep learning: almost no one trains from scratch anymore. The standard workflow at companies deploying deep learning — from two-person startups to Google — is:
- Find a pretrained model that was trained on a large, general-purpose dataset.
- Adapt it to your specific domain and task.
- Deploy.
This chapter explains why this workflow dominates, formalizes the strategies for adaptation, and teaches you to make the critical decisions: when to use a pretrained model, which pretrained model, how much to adapt, and how to evaluate whether the adaptation worked.
Simplest Model That Works: Transfer learning is the most powerful instantiation of this theme. A pretrained model already encodes millions of dollars worth of compute and vast quantities of data. Using it is not laziness — it is engineering discipline. Training from scratch when a pretrained model exists is like writing your own database engine when PostgreSQL is available: technically impressive, practically unwise, and likely to produce an inferior result.
The Economics of Pretraining
The economic argument for transfer learning is stark. Consider the compute required to train models from scratch:
| Model | Parameters | Training Tokens/Images | Estimated Compute Cost |
|---|---|---|---|
| ResNet-50 | 25M | 1.2M images (ImageNet) | ~$50 (4 GPU-days) |
| BERT-base | 110M | 3.3B tokens | ~$5,000 (16 TPU-days) |
| ViT-Large | 307M | 14M images (ImageNet-21k) | ~$15,000 |
| GPT-3 | 175B | 300B tokens | ~$4,600,000 |
| Llama 3 70B | 70B | 15T tokens | ~$10,000,000+ |
Fine-tuning any of these models on your domain-specific data costs a small fraction of the pretraining budget: typically $10-$1,000 and a few hours on a single GPU. The pretrained model is a compressed representation of a massive dataset, and transfer learning lets you leverage that representation without paying the pretraining cost.
Why Transfer Learning Works: Feature Reuse
Transfer learning works because learned features are often general. The seminal insight came from Zeiler and Fergus (2014) and Yosinski et al. (2014), who visualized the features learned by CNNs trained on ImageNet:
- Layer 1: Gabor-like edge detectors and color blobs — universal across all visual tasks.
- Layer 2: Textures and simple patterns — still highly transferable.
- Layer 3: Object parts and more complex patterns — partially transferable.
- Layer 4+: Task-specific combinations — less transferable.
This hierarchy creates a transferability gradient: early layers learn general features that transfer well, while later layers learn task-specific features that may not. The same principle applies to language models, where early transformer layers learn syntactic patterns that transfer across tasks, while later layers encode more task-specific semantics.
Research Insight: Yosinski et al. (2014) quantified the transferability of features across layers by freezing various combinations of early and late layers. They found a striking result: even random features from a pretrained network outperformed learning from scratch with limited data. The pretrained network's features were not just good initializations — they represented a better region of the loss landscape that random initialization rarely finds.
13.2 The Transfer Learning Spectrum
Transfer learning is not a single technique but a spectrum of strategies. The right strategy depends on three factors:
- How much labeled data do you have? (100 examples vs. 10,000 vs. 1,000,000)
- How different is your domain from the pretraining domain? (natural images → medical images vs. natural images → satellite imagery)
- How much compute can you afford? (single GPU for an hour vs. 8 GPUs for a week)
We organize the spectrum from least to most adaptation:
graph LR
A["Zero-Shot<br/>(No labeled data)"] --> B["Linear Probe<br/>(Frozen backbone + linear head)"]
B --> C["Fine-Tuning<br/>(Unfreeze some/all layers)"]
C --> D["Progressive Unfreezing<br/>(Gradual layer-by-layer)"]
D --> E["Train from Scratch<br/>(Random initialization)"]
style A fill:#e8f5e9
style B fill:#c8e6c9
style C fill:#a5d6a7
style D fill:#81c784
style E fill:#66bb6a
Strategy 1: Zero-Shot Inference
With zero-shot inference, you use a foundation model directly without any task-specific training. The model has learned such general representations during pretraining that it can handle your task out of the box.
When to use: No labeled data, or the task is well-represented in the pretraining distribution.
from transformers import pipeline
# Zero-shot text classification — no training required
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier(
"StreamRec user complained about too many cooking videos in their feed",
candidate_labels=["content relevance", "technical issue", "billing", "account"],
)
print(result["labels"][0], f"({result['scores'][0]:.3f})")
# content relevance (0.891)
The model was trained on natural language inference (NLI), not customer support classification. Yet it achieves reasonable accuracy because NLI — determining whether a premise entails a hypothesis — is a general enough capability that it transfers to classification when you frame each label as a hypothesis.
Strategy 2: Feature Extraction (Linear Probe)
Feature extraction freezes the pretrained backbone entirely and trains only a new classification head (typically a single linear layer). The backbone acts as a fixed feature extractor.
When to use: Small labeled dataset (100–1,000 examples), domain similar to pretraining data.
import torch
import torch.nn as nn
from torchvision import models, transforms
from torch.utils.data import DataLoader, TensorDataset
from typing import Tuple
import numpy as np
class LinearProbe(nn.Module):
"""Feature extraction with a frozen pretrained backbone.
The backbone is frozen (no gradient computation), and only
the linear classification head is trained. This is equivalent
to using the backbone as a fixed feature extractor.
Args:
backbone: Pretrained model (e.g., resnet50).
feature_dim: Dimensionality of backbone output features.
num_classes: Number of target classes.
"""
def __init__(
self, backbone: nn.Module, feature_dim: int, num_classes: int
) -> None:
super().__init__()
self.backbone = backbone
# Freeze all backbone parameters
for param in self.backbone.parameters():
param.requires_grad = False
self.head = nn.Linear(feature_dim, num_classes)
def forward(self, x: torch.Tensor) -> torch.Tensor:
with torch.no_grad():
features = self.backbone(x)
return self.head(features)
def build_linear_probe(num_classes: int) -> LinearProbe:
"""Build a linear probe on top of a pretrained ResNet-50.
Returns a model where only the final linear layer is trainable.
The pretrained ResNet-50 backbone is used as a frozen feature extractor.
Args:
num_classes: Number of output classes.
Returns:
LinearProbe model ready for training.
"""
# Load pretrained ResNet-50, remove its classification head
resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
feature_dim = resnet.fc.in_features # 2048
resnet.fc = nn.Identity() # Replace FC with identity
return LinearProbe(resnet, feature_dim, num_classes)
# Usage
model = build_linear_probe(num_classes=10)
# Count trainable parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Trainable fraction: {trainable_params / total_params:.4%}")
# Total parameters: 23,528,522
# Trainable parameters: 20,490
# Trainable fraction: 0.0871%
The linear probe has a remarkable property: because the backbone is frozen, it is convex in the head parameters. This means the optimal linear head can be found with ordinary least squares (or logistic regression for classification), and training is deterministic — no learning rate tuning, no local minima.
Common Misconception: "If a linear probe works well, the pretrained features are good. If it doesn't, the pretrained features are bad." This is only half right. A linear probe tests whether the pretrained features are linearly separable for your task. The features might contain all the information needed, but in a form that requires nonlinear combination. In such cases, fine-tuning or a small nonlinear head will substantially outperform the linear probe, even though the underlying features are the same.
Strategy 3: Fine-Tuning
Fine-tuning unfreezes some or all of the pretrained backbone and trains it jointly with the new head, using a small learning rate to avoid destroying the pretrained representations.
When to use: Moderate labeled dataset (1,000–100,000 examples), domain partially different from pretraining data.
The key insight is learning rate differential: the pretrained layers should update more slowly than the new head, because the pretrained representations are already good and we want to refine them, not overwrite them.
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
class FineTunedModel(nn.Module):
"""Fine-tuned pretrained model with differential learning rates.
All backbone parameters are unfrozen and trained with a smaller
learning rate than the new classification head.
Args:
backbone: Pretrained model.
feature_dim: Dimensionality of backbone output features.
num_classes: Number of target classes.
dropout: Dropout probability before the classification head.
"""
def __init__(
self,
backbone: nn.Module,
feature_dim: int,
num_classes: int,
dropout: float = 0.2,
) -> None:
super().__init__()
self.backbone = backbone
self.dropout = nn.Dropout(dropout)
self.head = nn.Linear(feature_dim, num_classes)
def forward(self, x: torch.Tensor) -> torch.Tensor:
features = self.backbone(x)
features = self.dropout(features)
return self.head(features)
def build_fine_tuned_model(num_classes: int) -> Tuple[FineTunedModel, AdamW]:
"""Build a fine-tuned ResNet-50 with differential learning rates.
The backbone is trained with a 10x smaller learning rate than
the classification head, preserving pretrained representations
while allowing task-specific adaptation.
Args:
num_classes: Number of output classes.
Returns:
Tuple of (model, optimizer).
"""
resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
feature_dim = resnet.fc.in_features
resnet.fc = nn.Identity()
model = FineTunedModel(resnet, feature_dim, num_classes)
# Differential learning rates: backbone lr = head lr / 10
head_lr = 1e-3
backbone_lr = head_lr / 10
optimizer = AdamW(
[
{"params": model.backbone.parameters(), "lr": backbone_lr},
{"params": model.head.parameters(), "lr": head_lr},
],
weight_decay=0.01,
)
return model, optimizer
Strategy 4: Progressive Unfreezing
Progressive unfreezing, introduced by Howard and Ruder (2018) in their ULMFiT paper, is a disciplined approach to fine-tuning that unfreezes layers one group at a time, from the last (most task-specific) to the first (most general).
When to use: Moderate to large labeled dataset, risk of catastrophic forgetting (domain very different from pretraining data).
The procedure:
- Freeze all pretrained layers. Train only the new head for 1-2 epochs.
- Unfreeze the last pretrained layer group. Train for 1-2 epochs with a lower learning rate.
- Unfreeze the next layer group. Repeat.
- Continue until all layers are unfrozen.
At each stage, the learning rate for newly unfrozen layers is smaller than for previously unfrozen layers, creating a learning rate schedule that mirrors the transferability gradient.
from typing import List, Dict, Any
def progressive_unfreeze(
model: nn.Module,
layer_groups: List[List[nn.Parameter]],
base_lr: float = 1e-3,
lr_decay_factor: float = 0.3,
) -> List[Dict[str, Any]]:
"""Create parameter groups for progressive unfreezing.
Each layer group gets a learning rate that is lr_decay_factor
times smaller than the next group. The last group (classification
head) gets base_lr, the second-to-last gets base_lr * lr_decay_factor,
and so on.
This implements "discriminative learning rates" from ULMFiT
(Howard and Ruder, 2018).
Args:
model: The model to configure.
layer_groups: List of parameter groups, from earliest to latest.
base_lr: Learning rate for the last (newest) group.
lr_decay_factor: Multiplicative decay per group.
Returns:
Parameter groups suitable for an optimizer.
"""
n_groups = len(layer_groups)
param_groups = []
for i, params in enumerate(layer_groups):
# Earlier groups get smaller learning rates
group_lr = base_lr * (lr_decay_factor ** (n_groups - 1 - i))
param_groups.append({"params": params, "lr": group_lr})
return param_groups
def create_resnet_layer_groups(model: FineTunedModel) -> List[List[nn.Parameter]]:
"""Split a ResNet-50 into layer groups for progressive unfreezing.
Groups (from earliest to latest):
0: conv1, bn1 (stem)
1: layer1 (first residual block)
2: layer2
3: layer3
4: layer4
5: head (classification layer)
Args:
model: FineTunedModel wrapping a ResNet-50 backbone.
Returns:
List of parameter lists, one per layer group.
"""
backbone = model.backbone
groups = [
list(backbone.conv1.parameters()) + list(backbone.bn1.parameters()),
list(backbone.layer1.parameters()),
list(backbone.layer2.parameters()),
list(backbone.layer3.parameters()),
list(backbone.layer4.parameters()),
list(model.head.parameters()),
]
return groups
# Example: progressive unfreezing with discriminative LR
# model, _ = build_fine_tuned_model(num_classes=10)
# layer_groups = create_resnet_layer_groups(model)
# param_groups = progressive_unfreeze(model, layer_groups, base_lr=1e-3)
# optimizer = AdamW(param_groups, weight_decay=0.01)
# Stage 1: Freeze groups 0-4, train only group 5 (head)
# Stage 2: Unfreeze group 4 (layer4), train groups 4-5
# Stage 3: Unfreeze group 3 (layer3), train groups 3-5
# ... continue until all groups are unfrozen
Strategy 5: Training from Scratch
When should you train from scratch? Almost never — but there are genuine cases:
- Your domain is radically different from any pretraining data. Scientific imaging modalities (electron microscopy, radio astronomy) may have no useful pretrained model.
- You have enormous amounts of labeled data. With millions of labeled examples in your domain, a custom architecture and training procedure may outperform a pretrained model that carries irrelevant inductive biases.
- Latency or model size constraints. If you need a model under 1MB for edge deployment, no pretrained foundation model will fit. A custom-designed small architecture trained from scratch may be the only option.
- Regulatory or IP constraints. Some organizations cannot use models trained on data whose provenance is unclear.
Production Reality: In industry, the decision is rarely pure. Many teams start with a pretrained model, fine-tune it to establish a strong baseline, and then — only after validating that the task justifies the investment — explore training a custom model from scratch. The pretrained baseline gives you a performance target and a deployment timeline. Starting from scratch without that baseline is premature optimization.
The Decision Framework
The following decision tree captures the practical heuristic that most senior practitioners use:
graph TD
A["How much labeled data?"] -->|"< 100 examples"| B["Zero-shot or few-shot<br/>with foundation model"]
A -->|"100 - 1,000"| C["Linear probe<br/>(frozen backbone)"]
A -->|"1,000 - 100,000"| D{"Domain distance?"}
A -->|"> 100,000"| E{"Compute budget?"}
D -->|"Small"| F["Full fine-tuning<br/>(differential LR)"]
D -->|"Large"| G["Progressive unfreezing"]
E -->|"Limited"| F
E -->|"Substantial"| H["Consider training<br/>from scratch"]
style B fill:#e3f2fd
style C fill:#e3f2fd
style F fill:#e3f2fd
style G fill:#e3f2fd
style H fill:#fff3e0
This is a heuristic, not a law. But it captures the overwhelming empirical evidence: for most tasks, with most data budgets, pretrained models outperform training from scratch. The burden of proof is on training from scratch, not on transfer learning.
13.3 Domain Adaptation and Domain Shift
Transfer learning assumes that features learned on a source domain are useful in a target domain. When this assumption holds, transfer works beautifully. When it breaks — when the source and target distributions differ substantially — we face domain shift, and transfer can actually hurt performance.
Formalizing Domain Shift
Let $p_S(\mathbf{x}, y)$ be the source (pretraining) distribution and $p_T(\mathbf{x}, y)$ the target distribution. Domain shift occurs when $p_S \neq p_T$. The shift can decompose into several types:
| Type | Formal Condition | Example |
|---|---|---|
| Covariate shift | $p_S(\mathbf{x}) \neq p_T(\mathbf{x})$, $p_S(y \mid \mathbf{x}) = p_T(y \mid \mathbf{x})$ | ImageNet photos → satellite images (different input distribution, same labeling function) |
| Label shift | $p_S(y) \neq p_T(y)$, $p_S(\mathbf{x} \mid y) = p_T(\mathbf{x} \mid y)$ | Balanced training set → imbalanced deployment (different class frequencies) |
| Concept drift | $p_S(y \mid \mathbf{x}) \neq p_T(y \mid \mathbf{x})$ | User preferences change over time (same content, different engagement patterns) |
Measuring Domain Distance
How similar are two domains? Several metrics quantify domain distance, and the choice matters for predicting whether transfer will succeed:
Maximum Mean Discrepancy (MMD). For feature representations $\phi(\mathbf{x})$ mapped into a reproducing kernel Hilbert space $\mathcal{H}$:
$$\text{MMD}(\mathcal{D}_S, \mathcal{D}_T) = \left\| \frac{1}{|\mathcal{D}_S|} \sum_{\mathbf{x} \in \mathcal{D}_S} \phi(\mathbf{x}) - \frac{1}{|\mathcal{D}_T|} \sum_{\mathbf{x} \in \mathcal{D}_T} \phi(\mathbf{x}) \right\|_{\mathcal{H}}$$
MMD equals zero when the source and target feature distributions are identical. In practice, we compute MMD using a Gaussian kernel over the penultimate layer features of the pretrained model.
Proxy A-Distance. Train a linear classifier to distinguish source from target examples. If the classifier achieves accuracy $A$, the proxy A-distance is $d_A = 2(1 - 2\epsilon)$ where $\epsilon = 1 - A$. If the classifier cannot distinguish the domains ($A \approx 0.5$), the domains are similar and transfer should work. If $A \approx 1.0$, the domains are very different.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
def proxy_a_distance(
source_features: np.ndarray, target_features: np.ndarray
) -> float:
"""Compute the Proxy A-Distance between source and target domains.
Trains a logistic regression classifier to distinguish source from
target features. High accuracy means the domains are very different
(transfer may fail); accuracy near 0.5 means domains are similar
(transfer should work).
Reference: Ben-David et al., "A theory of learning from different
domains" (Machine Learning, 2010).
Args:
source_features: Feature matrix from source domain (n_source, d).
target_features: Feature matrix from target domain (n_target, d).
Returns:
Proxy A-distance (0 = identical domains, 2 = maximally different).
"""
X = np.vstack([source_features, target_features])
y = np.concatenate([
np.zeros(len(source_features)),
np.ones(len(target_features)),
])
clf = LogisticRegression(max_iter=1000, C=1.0)
accuracy = cross_val_score(clf, X, y, cv=5, scoring="accuracy").mean()
error = 1.0 - accuracy
return 2.0 * (1.0 - 2.0 * error)
Mathematical Foundation: Ben-David et al. (2010) proved a generalization bound for domain adaptation that explicitly includes the domain distance:
$$\epsilon_T(h) \leq \epsilon_S(h) + \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_S, \mathcal{D}_T) + \lambda^*$$
where $\epsilon_T(h)$ is the target error, $\epsilon_S(h)$ is the source error, $d_{\mathcal{H} \Delta \mathcal{H}}$ is the $\mathcal{H}$-divergence (related to proxy A-distance), and $\lambda^*$ is the error of the ideal joint hypothesis. This bound tells us that transfer learning can only work if (1) we have low error on the source, (2) the domains are close, and (3) there exists some hypothesis that works well on both. If any of these conditions fail, transfer is doomed.
Negative Transfer
When domain distance is large, fine-tuning a pretrained model can underperform training from scratch. This is called negative transfer, and it occurs when:
- The pretrained features encode source-specific patterns that are actively misleading in the target domain.
- The fine-tuning process cannot unlearn these patterns with the available target data.
- The inductive biases from pretraining (e.g., translation invariance from CNNs trained on natural images) conflict with target domain structure.
The defense against negative transfer is monitoring: always compare your transferred model against a trained-from-scratch baseline (even a simple one) to verify that transfer is helping.
13.4 Self-Supervised Learning: How Pretrained Models Are Built
To use pretrained models effectively, you need to understand how they were trained. Most modern pretrained models use self-supervised learning — a training paradigm where the supervision signal comes from the data itself, not from human labels.
Masked Language Modeling (BERT)
BERT (Devlin et al., 2019) randomly masks 15% of input tokens and trains the model to predict them:
$$\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log p_\theta(x_i \mid \mathbf{x}_{\setminus \mathcal{M}})$$
where $\mathcal{M}$ is the set of masked positions and $\mathbf{x}_{\setminus \mathcal{M}}$ is the input with masked tokens replaced. The model must learn syntax, semantics, and world knowledge to fill in the blanks.
Masked Image Modeling (MAE)
Masked autoencoders (He et al., 2022) apply the same idea to images: mask 75% of image patches and train a vision transformer to reconstruct the missing patches. The high masking ratio (compared to 15% for text) works because images have higher spatial redundancy — neighboring patches are more predictable than neighboring words.
Contrastive Learning (SimCLR, DINO)
Contrastive learning learns representations by pulling together different views of the same data point and pushing apart views of different data points.
SimCLR (Chen et al., 2020) creates two augmented views of each image and trains the model so that representations of the same image are similar while representations of different images are dissimilar. The loss function is the normalized temperature-scaled cross-entropy (NT-Xent):
$$\mathcal{L}_{\text{SimCLR}} = -\log \frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k) / \tau)}$$
where $\mathbf{z}_i$ and $\mathbf{z}_j$ are the representations of two augmented views of the same image, $\text{sim}(\mathbf{u}, \mathbf{v}) = \mathbf{u}^\top \mathbf{v} / (\|\mathbf{u}\| \|\mathbf{v}\|)$ is cosine similarity, $\tau$ is a temperature parameter, and the sum in the denominator is over all $2N$ views in the batch (both augmented versions of all $N$ images), excluding the anchor $i$ itself.
DINO (Caron et al., 2021) uses a self-distillation framework: a student network learns to match the output of a teacher network (an exponential moving average of the student), where teacher and student see different augmented views. DINO is notable because its attention maps spontaneously learn to segment objects — an emergent property that was not explicitly trained for.
Contrastive Language-Image Pretraining (CLIP)
CLIP (Radford et al., 2021) extends contrastive learning to paired image-text data. Given a batch of $N$ (image, text) pairs, CLIP learns image and text encoders that maximize the cosine similarity of matching pairs while minimizing it for non-matching pairs:
$$\mathcal{L}_{\text{CLIP}} = -\frac{1}{2N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_j) / \tau)} + \log \frac{\exp(\text{sim}(\mathbf{t}_i, \mathbf{v}_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(\mathbf{t}_i, \mathbf{v}_j) / \tau)} \right]$$
where $\mathbf{v}_i$ is the image embedding and $\mathbf{t}_i$ is the text embedding for the $i$-th pair. The first term pulls image $i$ toward its correct text and away from all other texts; the second does the same from the text side.
CLIP's power comes from its scale (400M image-text pairs from the internet) and its generality: the learned embeddings support zero-shot classification by computing similarity between an image embedding and text embeddings of class descriptions ("a photo of a dog", "a photo of a cat").
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch
def clip_zero_shot_classify(
image_path: str, candidate_labels: list[str]
) -> dict[str, float]:
"""Zero-shot image classification using CLIP.
Computes cosine similarity between the image embedding and
text embeddings of each candidate label, returning normalized
probabilities.
Args:
image_path: Path to the image file.
candidate_labels: List of class descriptions.
Returns:
Dictionary mapping labels to probabilities.
"""
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open(image_path)
prompts = [f"a photo of {label}" for label in candidate_labels]
inputs = processor(
text=prompts, images=image, return_tensors="pt", padding=True
)
outputs = model(**inputs)
# Cosine similarity → softmax → probabilities
logits_per_image = outputs.logits_per_image # (1, n_labels)
probs = logits_per_image.softmax(dim=1).squeeze()
return {
label: prob.item()
for label, prob in zip(candidate_labels, probs)
}
Research Insight: CLIP's training objective is symmetric — it can be read as "for each image, find its matching text" or "for each text, find its matching image." This symmetry makes it a natural fit for retrieval systems, where you need to score query-document pairs in both directions. The two-tower architecture we build for StreamRec (Section 13.6) uses the same principle: one tower for queries (users), one tower for documents (items), trained with contrastive loss.
13.5 Foundation Models and the New Workflow
The term foundation model (Bommasani et al., 2021) refers to a model trained on broad data that can be adapted to a wide range of downstream tasks. The defining characteristics are:
- Scale: Trained on massive datasets (billions of tokens or images).
- Generality: Representations transfer across many tasks without architectural modification.
- Emergence: Capabilities that were not explicitly trained for appear at sufficient scale (e.g., CLIP learning object segmentation, LLMs learning arithmetic).
Sentence Embeddings and Embedding Pipelines
For many production systems, the primary use of pretrained models is not classification or generation but embedding — mapping inputs to dense vector representations that can be used for retrieval, clustering, and similarity search.
The standard embedding pipeline:
graph LR
A["Raw Input<br/>(text, image, audio)"] --> B["Pretrained Encoder<br/>(BERT, ViT, CLIP)"]
B --> C["Pooling<br/>(CLS token, mean)"]
C --> D["Embedding Vector<br/>(d = 384-1024)"]
D --> E["Vector Database<br/>(FAISS, Pinecone)"]
E --> F["Downstream Task<br/>(retrieval, clustering)"]
For text, the dominant approach uses sentence transformers (Reimers and Gurevych, 2019): BERT-like models fine-tuned with a contrastive objective so that similar sentences have similar embeddings.
from sentence_transformers import SentenceTransformer
import numpy as np
def build_item_embeddings(
descriptions: list[str],
model_name: str = "all-MiniLM-L6-v2",
batch_size: int = 64,
) -> np.ndarray:
"""Encode item descriptions into dense embedding vectors.
Uses a pretrained sentence transformer to generate embeddings
that capture semantic similarity. Items with similar descriptions
will have high cosine similarity in the embedding space.
Args:
descriptions: List of item description strings.
model_name: Name of the sentence transformer model.
batch_size: Encoding batch size.
Returns:
Embedding matrix of shape (n_items, embedding_dim).
"""
model = SentenceTransformer(model_name)
embeddings = model.encode(
descriptions,
batch_size=batch_size,
show_progress_bar=True,
normalize_embeddings=True, # Unit norm for cosine similarity
)
return embeddings
# StreamRec: embed item catalog
# item_descriptions = [
# "Documentary about the history of jazz music in New Orleans",
# "Tutorial on advanced Python decorators and metaclasses",
# "Cooking show featuring traditional Japanese ramen recipes",
# ]
# embeddings = build_item_embeddings(item_descriptions)
# print(embeddings.shape) # (3, 384)
Adapters and Parameter-Efficient Fine-Tuning
Fine-tuning an entire foundation model is expensive in memory and storage. If you have 10 downstream tasks, do you need 10 copies of GPT-3? Adapters (Houlsby et al., 2019) solve this by inserting small trainable modules into the frozen pretrained model:
$$\mathbf{h} \leftarrow \mathbf{h} + f(\mathbf{h} \, W_{\text{down}}) \, W_{\text{up}}$$
where $W_{\text{down}} \in \mathbb{R}^{d \times r}$ projects the hidden state to a low-rank bottleneck of dimension $r \ll d$, $f$ is a nonlinearity, and $W_{\text{up}} \in \mathbb{R}^{r \times d}$ projects back. Only $W_{\text{down}}$ and $W_{\text{up}}$ are trained; the original model weights are frozen.
LoRA (Hu et al., 2022) takes a different approach: instead of adding new modules, it adds a low-rank perturbation to existing weight matrices:
$$W' = W + \Delta W = W + BA$$
where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times d}$, and $r \ll d$. The rank $r$ controls the tradeoff between expressiveness and parameter efficiency.
import torch.nn as nn
import torch
import math
class LoRALinear(nn.Module):
"""Linear layer with LoRA (Low-Rank Adaptation).
Adds a low-rank perturbation BA to the frozen pretrained weight W,
so the effective weight is W + BA. Only B and A are trained.
Reference: Hu et al., "LoRA: Low-Rank Adaptation of Large Language
Models" (ICLR, 2022).
Args:
pretrained_linear: Original frozen linear layer.
rank: Rank of the LoRA decomposition.
alpha: Scaling factor (effective lr multiplier = alpha / rank).
"""
def __init__(
self, pretrained_linear: nn.Linear, rank: int = 8, alpha: float = 16.0
) -> None:
super().__init__()
self.pretrained = pretrained_linear
self.pretrained.weight.requires_grad = False
if self.pretrained.bias is not None:
self.pretrained.bias.requires_grad = False
d_out, d_in = pretrained_linear.weight.shape
self.lora_A = nn.Parameter(torch.randn(rank, d_in) / math.sqrt(d_in))
self.lora_B = nn.Parameter(torch.zeros(d_out, rank))
self.scaling = alpha / rank
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Original output + low-rank perturbation
base_output = self.pretrained(x)
lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
return base_output + lora_output
def apply_lora_to_model(
model: nn.Module,
target_modules: list[str],
rank: int = 8,
alpha: float = 16.0,
) -> nn.Module:
"""Apply LoRA to specified linear layers in a model.
Replaces each target linear layer with a LoRALinear wrapper,
freezing the original weights and adding trainable low-rank
perturbation matrices.
Args:
model: The pretrained model.
target_modules: Names of modules to apply LoRA to.
rank: LoRA rank.
alpha: LoRA scaling factor.
Returns:
Modified model with LoRA layers.
"""
for name, module in model.named_modules():
if any(target in name for target in target_modules):
if isinstance(module, nn.Linear):
parent_name = ".".join(name.split(".")[:-1])
child_name = name.split(".")[-1]
parent = model.get_submodule(parent_name) if parent_name else model
setattr(parent, child_name, LoRALinear(module, rank, alpha))
return model
Prompt tuning (Lester et al., 2021) takes parameter efficiency further: instead of modifying the model at all, it prepends a set of trainable "soft prompt" vectors to the input. The model is entirely frozen; only the prompt vectors (typically 10-100 tokens) are optimized. This is the most parameter-efficient adaptation method, but it works best for large language models where the prompt can steer behavior through the model's existing capabilities.
| Method | Trainable Parameters | Storage per Task | Inference Overhead |
|---|---|---|---|
| Full fine-tuning | 100% | Full model copy | None |
| Adapter | ~2-4% | Adapter weights only | Small forward pass overhead |
| LoRA | ~0.1-1% | LoRA matrices only | None (merge at deployment) |
| Prompt tuning | ~0.01% | Prompt vectors only | None |
Production Reality: LoRA has become the default fine-tuning method in industry for models above ~1B parameters. The reason is practical: LoRA weights can be merged into the base model weights at deployment time ($W' = W + BA$), so there is zero inference overhead. This means you can serve one base model and swap task-specific LoRA adapters in and out with negligible latency cost. Multiple LoRA adapters can even be served simultaneously by batching requests for different tasks through the same base model.
13.6 Two-Tower Models and Contrastive Learning for Retrieval
The concepts from this chapter — pretrained encoders, embedding pipelines, contrastive learning — converge in the two-tower model, the workhorse architecture for modern retrieval systems. This is the architecture that powers the candidate retrieval stage at companies like Google (YouTube), Meta (Instagram), Spotify, and Netflix.
Architecture
A two-tower model has two independent encoder networks:
- Query tower (user tower): Encodes the user's context (profile, history, current session) into a dense vector $\mathbf{q} \in \mathbb{R}^d$.
- Item tower (document tower): Encodes each item (metadata, description, features) into a dense vector $\mathbf{k} \in \mathbb{R}^d$.
The relevance score is the inner product (or cosine similarity) between query and item embeddings:
$$\text{score}(q, k) = \mathbf{q}^\top \mathbf{k}$$
The two-tower architecture has a critical deployment advantage: item embeddings can be precomputed and stored in a vector index (FAISS, ScaNN). At serving time, only the query tower runs, and retrieval is a nearest-neighbor search — sublinear in the number of items.
graph LR
subgraph "Query Tower"
U1["User Profile"] --> UE["Pretrained<br/>Transformer"]
U2["Watch History"] --> UE
UE --> UP["Projection"]
UP --> QV["q ∈ ℝ^d"]
end
subgraph "Item Tower"
I1["Title + Description"] --> IE["Pretrained<br/>Text Encoder"]
I2["Categories + Tags"] --> IE
IE --> IP["Projection"]
IP --> KV["k ∈ ℝ^d"]
end
QV -.->|"score = q · k"| KV
Contrastive Loss for Two-Tower Training
The two-tower model is trained with a contrastive loss. For each user $i$ in a batch, we have one positive item $k_i^+$ (the item the user actually engaged with) and $K$ negative items $\{k_j^-\}$ (items the user did not engage with). The loss is:
$$\mathcal{L}_{\text{contrast}} = -\sum_{i=1}^{N} \log \frac{\exp(\mathbf{q}_i^\top \mathbf{k}_i^+ / \tau)}{\exp(\mathbf{q}_i^\top \mathbf{k}_i^+ / \tau) + \sum_{j=1}^{K} \exp(\mathbf{q}_i^\top \mathbf{k}_j^- / \tau)}$$
This is the same InfoNCE loss (Oord et al., 2018) used in SimCLR and CLIP, applied to user-item pairs instead of image augmentations or image-text pairs.
Mathematical Foundation: The contrastive loss has a clean information-theoretic interpretation. Minimizing $\mathcal{L}_{\text{contrast}}$ is equivalent to maximizing a lower bound on the mutual information $I(\mathbf{q}; \mathbf{k}^+)$ between the query and positive item representations (Poole et al., 2019). The bound tightens with more negatives:
$$I(\mathbf{q}; \mathbf{k}^+) \geq \log K - \mathcal{L}_{\text{contrast}}$$
This explains why larger batch sizes (more in-batch negatives) improve contrastive learning performance: they tighten the mutual information bound. However, the bound saturates at $\log K$, so there are diminishing returns beyond a certain batch size.
Negative Sampling Strategies
The choice of negatives dramatically affects what the model learns:
- In-batch negatives: Use other items in the same batch as negatives. Simple and efficient, but biased toward popular items (popular items appear in more batches and receive more negative signal).
- Hard negatives: Mine items that are similar to the positive but not identical — items the user viewed but did not engage with, or items from the same category. Hard negatives force the model to make finer distinctions.
- Mixed negatives: Combine random negatives (for broad coverage) with hard negatives (for discrimination).
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional
class TwoTowerModel(nn.Module):
"""Two-tower model for user-item retrieval with contrastive learning.
Each tower independently encodes its input into a shared embedding
space. Training uses in-batch contrastive loss (InfoNCE) where all
other items in the batch serve as negatives for each user.
Args:
user_encoder: Pretrained encoder for user features.
item_encoder: Pretrained encoder for item features.
embedding_dim: Dimensionality of the shared embedding space.
temperature: Temperature parameter for contrastive loss.
"""
def __init__(
self,
user_encoder: nn.Module,
item_encoder: nn.Module,
embedding_dim: int = 128,
temperature: float = 0.07,
) -> None:
super().__init__()
self.user_encoder = user_encoder
self.item_encoder = item_encoder
self.temperature = temperature
# Projection heads to shared embedding space
# (These are always trained from scratch, even when encoders are frozen)
self.user_projection = nn.Sequential(
nn.Linear(embedding_dim, embedding_dim),
nn.ReLU(),
nn.Linear(embedding_dim, embedding_dim),
)
self.item_projection = nn.Sequential(
nn.Linear(embedding_dim, embedding_dim),
nn.ReLU(),
nn.Linear(embedding_dim, embedding_dim),
)
def encode_user(self, user_features: torch.Tensor) -> torch.Tensor:
"""Encode user features into the shared embedding space.
Args:
user_features: User feature tensor.
Returns:
L2-normalized user embedding.
"""
h = self.user_encoder(user_features)
z = self.user_projection(h)
return F.normalize(z, dim=-1)
def encode_item(self, item_features: torch.Tensor) -> torch.Tensor:
"""Encode item features into the shared embedding space.
Args:
item_features: Item feature tensor.
Returns:
L2-normalized item embedding.
"""
h = self.item_encoder(item_features)
z = self.item_projection(h)
return F.normalize(z, dim=-1)
def forward(
self,
user_features: torch.Tensor,
item_features: torch.Tensor,
) -> torch.Tensor:
"""Compute contrastive loss for a batch of user-item pairs.
Each (user_i, item_i) is a positive pair. All other items in the
batch serve as negatives for each user (in-batch negatives).
Args:
user_features: User features (batch_size, feature_dim).
item_features: Item features (batch_size, feature_dim).
Returns:
Scalar contrastive loss (InfoNCE).
"""
user_emb = self.encode_user(user_features) # (B, d)
item_emb = self.encode_item(item_features) # (B, d)
# Similarity matrix: (B, B)
# sim[i, j] = cosine_similarity(user_i, item_j) / temperature
logits = torch.mm(user_emb, item_emb.T) / self.temperature
# Labels: user_i should match item_i (diagonal)
labels = torch.arange(logits.size(0), device=logits.device)
# Symmetric loss: user→item and item→user
loss_u2i = F.cross_entropy(logits, labels)
loss_i2u = F.cross_entropy(logits.T, labels)
return (loss_u2i + loss_i2u) / 2.0
def train_two_tower(
model: TwoTowerModel,
user_features: torch.Tensor,
item_features: torch.Tensor,
epochs: int = 10,
batch_size: int = 256,
learning_rate: float = 1e-4,
) -> list[float]:
"""Train a two-tower model with contrastive learning.
Args:
model: TwoTowerModel instance.
user_features: All user feature vectors (n_interactions, user_dim).
item_features: Corresponding item feature vectors (n_interactions, item_dim).
epochs: Number of training epochs.
batch_size: Training batch size (larger = more negatives).
learning_rate: Learning rate.
Returns:
List of per-epoch average losses.
"""
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=epochs
)
dataset = torch.utils.data.TensorDataset(user_features, item_features)
loader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True, drop_last=True
)
epoch_losses = []
for epoch in range(epochs):
total_loss = 0.0
n_batches = 0
for user_batch, item_batch in loader:
optimizer.zero_grad()
loss = model(user_batch, item_batch)
loss.backward()
optimizer.step()
total_loss += loss.item()
n_batches += 1
scheduler.step()
avg_loss = total_loss / n_batches
epoch_losses.append(avg_loss)
print(f"Epoch {epoch + 1}/{epochs}, Loss: {avg_loss:.4f}")
return epoch_losses
Evaluation Metrics for Retrieval
Two-tower models are evaluated with retrieval metrics, not classification metrics:
Recall@K: The fraction of relevant items that appear in the top-K retrieved items:
$$\text{Recall@}K = \frac{|\text{relevant items in top-}K|}{|\text{all relevant items}|}$$
Hit Rate@K (equivalent to Recall@1 when there is one relevant item): Whether the relevant item appears in the top-K:
$$\text{HR@}K = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\text{relevant item}_i \in \text{top-}K_i]$$
Mean Reciprocal Rank (MRR): The average of the reciprocal of the rank of the first relevant item:
$$\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}$$
def evaluate_retrieval(
user_embeddings: torch.Tensor,
item_embeddings: torch.Tensor,
ground_truth: torch.Tensor,
k_values: list[int] = [1, 5, 10, 50, 100],
) -> dict[str, float]:
"""Evaluate retrieval model with standard metrics.
Computes the full similarity matrix and ranks items for each user.
Args:
user_embeddings: Query embeddings (n_users, d).
item_embeddings: Item embeddings (n_items, d).
ground_truth: True relevant item index for each user (n_users,).
k_values: List of K values for Recall@K and HR@K.
Returns:
Dictionary of metric_name → value.
"""
# Compute similarity matrix
similarity = torch.mm(user_embeddings, item_embeddings.T) # (n_users, n_items)
# Get ranks of ground truth items
sorted_indices = similarity.argsort(dim=1, descending=True)
ranks = torch.zeros(len(ground_truth), dtype=torch.long)
for i, gt in enumerate(ground_truth):
ranks[i] = (sorted_indices[i] == gt).nonzero(as_tuple=True)[0][0] + 1
metrics = {}
for k in k_values:
metrics[f"HR@{k}"] = (ranks <= k).float().mean().item()
metrics[f"Recall@{k}"] = (ranks <= k).float().mean().item()
metrics["MRR"] = (1.0 / ranks.float()).mean().item()
metrics["Median_Rank"] = ranks.float().median().item()
return metrics
13.7 The HuggingFace Ecosystem
The practical infrastructure for transfer learning centers on HuggingFace, which has become the de facto standard for sharing and using pretrained models. Understanding the HuggingFace ecosystem is a professional skill for any practitioner working with deep learning.
Model Hub
The Model Hub hosts over 500,000 pretrained models across modalities (text, image, audio, video, multimodal). Each model page includes:
- Model card: Documentation of architecture, training data, intended use, limitations, and bias considerations.
- Inference API: Test the model directly in the browser.
- Download metrics: How many people use this model (a rough quality signal).
The transformers Library
The transformers library provides a unified API for loading and using pretrained models:
from transformers import AutoModel, AutoTokenizer, AutoConfig
# Load any model with the Auto API
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Inspect the model
config = AutoConfig.from_pretrained(model_name)
print(f"Hidden size: {config.hidden_size}") # 768
print(f"Num layers: {config.num_hidden_layers}") # 12
print(f"Num heads: {config.num_attention_heads}") # 12
print(f"Vocab size: {config.vocab_size}") # 30522
# Encode text
inputs = tokenizer(
"StreamRec recommends jazz documentaries",
return_tensors="pt",
padding=True,
truncation=True,
)
outputs = model(**inputs)
# outputs.last_hidden_state: (batch, seq_len, hidden_size)
# outputs.pooler_output: (batch, hidden_size) — CLS token representation
print(f"Sequence output shape: {outputs.last_hidden_state.shape}")
print(f"Pooled output shape: {outputs.pooler_output.shape}")
The Trainer API
For fine-tuning, the Trainer class handles the training loop, evaluation, logging, checkpointing, and distributed training:
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
TrainingArguments,
Trainer,
)
from datasets import load_dataset
import numpy as np
def fine_tune_classifier(
model_name: str = "distilbert-base-uncased",
dataset_name: str = "imdb",
num_epochs: int = 3,
batch_size: int = 16,
learning_rate: float = 2e-5,
output_dir: str = "./results",
) -> Trainer:
"""Fine-tune a pretrained transformer for text classification.
Uses the HuggingFace Trainer API, which handles the training loop,
gradient accumulation, mixed precision, logging, and checkpointing.
Args:
model_name: Name of the pretrained model on HuggingFace Hub.
dataset_name: Name of the dataset on HuggingFace Hub.
num_epochs: Number of fine-tuning epochs.
batch_size: Per-device batch size.
learning_rate: Peak learning rate (with linear warmup).
output_dir: Directory for checkpoints and logs.
Returns:
Trained Trainer instance.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2
)
dataset = load_dataset(dataset_name)
def tokenize(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=256,
)
tokenized = dataset.map(tokenize, batched=True)
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
accuracy = (preds == labels).mean()
return {"accuracy": accuracy}
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size * 2,
learning_rate=learning_rate,
weight_decay=0.01,
warmup_ratio=0.1,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
fp16=True,
logging_steps=100,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
compute_metrics=compute_metrics,
)
trainer.train()
return trainer
Choosing a Pretrained Model
The abundance of pretrained models creates a new problem: how to choose. Here is a practical framework:
| Factor | What to Check | Why It Matters |
|---|---|---|
| Task alignment | Was the model pretrained on a similar task? | Closer pretraining tasks transfer better |
| Domain alignment | Was the model trained on data from your domain? | Domain-specific models (BioBERT, FinBERT) outperform general ones |
| Model size | Can you serve it within your latency budget? | Larger models are better but slower |
| License | Is the license compatible with your use case? | Apache 2.0 vs. research-only vs. commercial |
| Community adoption | Downloads, citations, leaderboard performance | Widely used models are better debugged |
| Recency | When was it trained? | Newer models generally outperform older ones |
Implementation Note: HuggingFace models cache downloads in
~/.cache/huggingface/. On a shared server, setHF_HOMEto a shared directory to avoid redundant downloads. For air-gapped environments, usehuggingface-cli downloadto pre-download models, or usemodel.save_pretrained()andmodel.from_pretrained(local_path)to load from local disk.
13.8 The Complete Modern DL Workflow
We now have all the pieces to describe the complete workflow that a senior deep learning practitioner follows for a new task.
Step 1: Problem Formulation
Define the task, the input/output format, and the evaluation metric. For StreamRec retrieval:
- Task: Given a user's profile and history, retrieve the top-K items most likely to be engaged with.
- Input: User features (profile, watch history). Item features (metadata, description).
- Output: Ranked list of items.
- Metric: Hit Rate@100, MRR.
Step 2: Baseline with Existing Pretrained Model
Start with a zero-shot or linear probe baseline. This takes hours, not weeks, and establishes whether your problem is already solved.
# Step 2a: Embed all items using a pretrained sentence transformer
# item_embeddings = build_item_embeddings(item_descriptions)
# Step 2b: Embed user queries as text ("user who watches jazz, cooking, tech")
# query_embeddings = build_item_embeddings(user_summaries)
# Step 2c: Nearest neighbor retrieval
# similarities = query_embeddings @ item_embeddings.T
# top_k = similarities.argsort(axis=1)[:, -100:][:, ::-1]
# Step 2d: Evaluate
# baseline_hr100 = hit_rate_at_k(top_k, ground_truth, k=100)
# print(f"Zero-shot HR@100: {baseline_hr100:.3f}")
Step 3: Fine-Tune or Adapt
If the baseline is promising but insufficient, fine-tune the pretrained model on your labeled data. For two-tower retrieval, this means training with contrastive loss on user-item engagement data.
Step 4: Evaluate Rigorously
Evaluate on a held-out test set that reflects production conditions:
- Temporal split: Train on data before time $t$, test on data after time $t$. Never random split for recommendation data — this leaks future information.
- Cold-start evaluation: Separately evaluate on new users and new items that were not in the training set.
- Fairness audit: Check retrieval quality across user demographics and content categories.
Step 5: Deploy
The two-tower architecture enables a clean deployment pattern:
- Offline: Run the item tower on all items, store embeddings in a vector index (FAISS).
- Online: Run the user tower on the current user's features, retrieve top-K from the vector index.
- Refresh: Re-index items periodically (daily or weekly) or when new items are added.
import faiss
def build_faiss_index(
item_embeddings: np.ndarray,
index_type: str = "IVFFlat",
nlist: int = 256,
) -> faiss.Index:
"""Build a FAISS index for fast approximate nearest neighbor search.
Args:
item_embeddings: Item embedding matrix (n_items, d).
index_type: Type of FAISS index.
nlist: Number of Voronoi cells for IVF index.
Returns:
Trained FAISS index.
"""
d = item_embeddings.shape[1]
item_embeddings = item_embeddings.astype(np.float32)
if index_type == "FlatIP":
# Exact inner product search (brute force)
index = faiss.IndexFlatIP(d)
elif index_type == "IVFFlat":
# Approximate search with inverted file index
quantizer = faiss.IndexFlatIP(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
index.train(item_embeddings)
index.nprobe = 32 # Search 32 of 256 cells
else:
raise ValueError(f"Unknown index type: {index_type}")
index.add(item_embeddings)
return index
def retrieve_top_k(
user_embedding: np.ndarray,
index: faiss.Index,
k: int = 100,
) -> tuple[np.ndarray, np.ndarray]:
"""Retrieve top-K items for a user using FAISS.
Args:
user_embedding: User query embedding (1, d) or (d,).
index: Trained FAISS index.
k: Number of items to retrieve.
Returns:
Tuple of (scores, item_indices), each of shape (1, k).
"""
if user_embedding.ndim == 1:
user_embedding = user_embedding.reshape(1, -1)
user_embedding = user_embedding.astype(np.float32)
scores, indices = index.search(user_embedding, k)
return scores, indices
Production Reality: At scale, the two-tower + FAISS pipeline handles millions of items with sub-10ms latency. The item tower runs offline (batch inference), so its compute cost is amortized across all users. The user tower must run online (per-request), so it must be fast — typically a single transformer forward pass. This is why two-tower models use relatively small encoders (BERT-base, not BERT-large) for the query tower in production: the latency constraint dominates.
13.9 StreamRec Progressive Project — Milestone M5
This milestone integrates the concepts from this chapter into the StreamRec recommendation system. You will build the two-tower retrieval model that becomes the candidate generation stage of the full pipeline.
Task
Build a two-tower retrieval model for StreamRec:
- User tower: Encode user profile and watch history using a pretrained transformer.
- Item tower: Encode item metadata (title, description, tags) using a pretrained sentence transformer.
- Training: Contrastive loss with in-batch negatives.
- Evaluation: HR@10, HR@100, MRR on a temporal test split.
- Deployment: Index item embeddings with FAISS for real-time retrieval.
Implementation Skeleton
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import Dict, List, Tuple
class StreamRecUserTower(nn.Module):
"""User tower for StreamRec two-tower retrieval.
Encodes user watch history as a sequence of item embeddings,
processes through a small transformer, and projects to the
shared embedding space.
Args:
pretrained_model: Name of the pretrained transformer.
embedding_dim: Output embedding dimensionality.
max_history: Maximum number of history items to consider.
"""
def __init__(
self,
pretrained_model: str = "distilbert-base-uncased",
embedding_dim: int = 128,
max_history: int = 50,
) -> None:
super().__init__()
self.encoder = AutoModel.from_pretrained(pretrained_model)
hidden_size = self.encoder.config.hidden_size # 768 for distilbert
# Freeze encoder, train projection
for param in self.encoder.parameters():
param.requires_grad = False
self.projection = nn.Sequential(
nn.Linear(hidden_size, 256),
nn.GELU(),
nn.Dropout(0.1),
nn.Linear(256, embedding_dim),
)
def forward(
self, input_ids: torch.Tensor, attention_mask: torch.Tensor
) -> torch.Tensor:
"""Encode user history into an embedding vector.
Args:
input_ids: Tokenized user history (batch, seq_len).
attention_mask: Attention mask (batch, seq_len).
Returns:
L2-normalized user embedding (batch, embedding_dim).
"""
with torch.no_grad():
outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
# Mean pooling over non-padding tokens
hidden = outputs.last_hidden_state # (batch, seq_len, hidden)
mask = attention_mask.unsqueeze(-1).float()
pooled = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1e-9)
projected = self.projection(pooled)
return F.normalize(projected, dim=-1)
class StreamRecItemTower(nn.Module):
"""Item tower for StreamRec two-tower retrieval.
Encodes item title and description using a pretrained sentence
transformer, then projects to the shared embedding space.
Args:
sentence_model: Name of the sentence transformer model.
embedding_dim: Output embedding dimensionality.
"""
def __init__(
self,
sentence_model: str = "all-MiniLM-L6-v2",
embedding_dim: int = 128,
) -> None:
super().__init__()
self.encoder = SentenceTransformer(sentence_model)
encoder_dim = self.encoder.get_sentence_embedding_dimension() # 384
# Freeze encoder, train projection
for param in self.encoder.parameters():
param.requires_grad = False
self.projection = nn.Sequential(
nn.Linear(encoder_dim, 256),
nn.GELU(),
nn.Dropout(0.1),
nn.Linear(256, embedding_dim),
)
def forward(self, item_embeddings: torch.Tensor) -> torch.Tensor:
"""Project pre-computed item embeddings to the shared space.
In practice, item embeddings from the sentence transformer
are pre-computed offline. This forward pass only applies
the learned projection.
Args:
item_embeddings: Pre-encoded item features (batch, encoder_dim).
Returns:
L2-normalized item embedding (batch, embedding_dim).
"""
projected = self.projection(item_embeddings)
return F.normalize(projected, dim=-1)
class StreamRecRetrieval(nn.Module):
"""Two-tower retrieval model for StreamRec (Milestone M5).
Combines user and item towers with contrastive (InfoNCE) loss.
Designed for the candidate retrieval stage of the recommendation
pipeline, preceding the transformer ranking model from M4.
Args:
embedding_dim: Shared embedding space dimensionality.
temperature: Contrastive loss temperature.
"""
def __init__(
self, embedding_dim: int = 128, temperature: float = 0.05
) -> None:
super().__init__()
self.user_tower = StreamRecUserTower(embedding_dim=embedding_dim)
self.item_tower = StreamRecItemTower(embedding_dim=embedding_dim)
self.temperature = temperature
def contrastive_loss(
self,
user_emb: torch.Tensor,
item_emb: torch.Tensor,
) -> torch.Tensor:
"""Compute symmetric InfoNCE loss with in-batch negatives.
Args:
user_emb: Normalized user embeddings (B, d).
item_emb: Normalized item embeddings (B, d).
Returns:
Scalar contrastive loss.
"""
logits = torch.mm(user_emb, item_emb.T) / self.temperature
labels = torch.arange(logits.size(0), device=logits.device)
loss_u2i = F.cross_entropy(logits, labels)
loss_i2u = F.cross_entropy(logits.T, labels)
return (loss_u2i + loss_i2u) / 2.0
Track Guidance
- Track A (Minimal): Freeze both encoders, train only projection heads. Evaluate HR@100 on temporal test split. Target: HR@100 > 0.15.
- Track B (Standard): Unfreeze the user tower encoder with differential learning rates. Add hard negative mining (sample items the user viewed but did not complete). Target: HR@100 > 0.25.
- Track C (Full): Fine-tune both towers with progressive unfreezing. Implement mixed negatives (50% in-batch, 50% hard). Add cross-modal retrieval: use CLIP to embed item thumbnails alongside text, concatenate embeddings. Target: HR@100 > 0.30.
Connection to Previous and Future Milestones
| Milestone | Chapter | Component |
|---|---|---|
| M1 (Ch. 5) | LSH/FAISS ANN | The vector index infrastructure reused here |
| M3 (Ch. 8) | 1D CNN content embeddings | Replaced by pretrained sentence transformer embeddings |
| M4 (Ch. 10) | Transformer session model | Becomes the ranking model after two-tower retrieval |
| M5 (Ch. 13) | Two-tower retrieval | Candidate generation with pretrained encoders |
| M6 (Ch. 14) | GNN collaborative filtering | Alternative retrieval using graph structure |
13.10 Climate DL: Fine-Tuning a Vision Transformer for Satellite Imagery
The Climate Deep Learning anchor provides a concrete example of fine-tuning in a domain (satellite imagery) that is moderately distant from typical pretraining data (ImageNet).
The Problem
The Pacific Climate Research Consortium (PCRC) needs to classify satellite imagery into land-use categories (forest, cropland, urban, water, barren, wetland) to track deforestation, urbanization, and land degradation. They have 8,000 labeled satellite images — too few to train a vision transformer from scratch, but enough for effective fine-tuning.
The Approach
Fine-tune a ViT (Vision Transformer) pretrained on ImageNet-21k:
from transformers import ViTForImageClassification, ViTFeatureExtractor
import torch
import torch.nn as nn
def build_satellite_classifier(
num_classes: int = 6,
pretrained_model: str = "google/vit-base-patch16-224",
freeze_backbone: bool = False,
) -> ViTForImageClassification:
"""Build a satellite image classifier by fine-tuning a pretrained ViT.
Loads a ViT pretrained on ImageNet-21k and replaces the classification
head for the satellite land-use classification task.
Args:
num_classes: Number of land-use categories.
pretrained_model: HuggingFace model identifier.
freeze_backbone: If True, freeze all backbone layers (linear probe).
Returns:
ViT model ready for fine-tuning.
"""
model = ViTForImageClassification.from_pretrained(
pretrained_model,
num_labels=num_classes,
ignore_mismatched_sizes=True, # Replace head with correct size
)
if freeze_backbone:
for name, param in model.named_parameters():
if "classifier" not in name:
param.requires_grad = False
# Count parameters
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total: {total:,} | Trainable: {trainable:,} ({100*trainable/total:.1f}%)")
return model
# Linear probe: only train the classification head
# model_probe = build_satellite_classifier(freeze_backbone=True)
# Total: 85,802,502 | Trainable: 4,614 (0.0%)
# Full fine-tuning: train everything
# model_ft = build_satellite_classifier(freeze_backbone=False)
# Total: 85,802,502 | Trainable: 85,802,502 (100.0%)
Domain Distance Analysis
Satellite imagery differs from ImageNet in several ways:
- Viewing angle: Top-down (nadir) vs. ground-level perspective.
- Color distribution: Vegetation indices (NDVI), false-color composites vs. natural color photographs.
- Texture patterns: Agricultural grids, forest canopy textures vs. animal fur, fabric patterns.
- Scale: A "building" in ImageNet fills the frame; in satellite imagery, a building is a few pixels.
Despite these differences, the early-layer features of ImageNet-pretrained models (edges, textures, color gradients) transfer reasonably well to satellite imagery. The later-layer features (object parts, scene layouts) transfer less well, which is why fine-tuning the full model outperforms a linear probe for this domain.
Research Insight: Neumann et al. (2019) and Manas et al. (2021) showed that self-supervised pretraining on unlabeled satellite imagery (using SimCLR or DINO) substantially outperforms ImageNet pretraining for remote sensing tasks. The key insight is that the pretraining domain matters as much as the pretraining method: features learned from satellite data are better priors for satellite tasks, even when the self-supervised pretraining uses no labels. If PCRC had access to a large corpus of unlabeled satellite images, self-supervised pretraining on that corpus followed by supervised fine-tuning on the 8,000 labeled images would be the strongest approach.
13.11 Putting It All Together
This chapter has covered the full landscape of modern deep learning practice — the workflow that professionals actually follow, rather than the train-from-scratch narrative that dominates textbooks. The core ideas:
-
Pretrained models encode massive datasets as reusable features. Using them is not a shortcut — it is the engineering-correct approach that respects the economics of compute and data.
-
The transfer learning spectrum (zero-shot → linear probe → fine-tuning → progressive unfreezing → train from scratch) maps to a decision framework based on labeled data, domain distance, and compute budget.
-
Self-supervised learning (masked modeling, contrastive learning) is how modern pretrained models are built. Understanding the pretraining objective tells you what the model learned and what it did not.
-
Foundation models (BERT, ViT, CLIP) have shifted the economics of deep learning: the cost of solving a new task has dropped from "train a model for weeks" to "fine-tune for hours."
-
Two-tower models with contrastive learning are the standard architecture for large-scale retrieval, using pretrained encoders for both query and document towers.
-
Parameter-efficient adaptation (LoRA, adapters, prompt tuning) enables fine-tuning models that are too large to fully train, making the benefits of scale accessible to practitioners without massive compute budgets.
-
The HuggingFace ecosystem is the practical infrastructure that makes all of this work: model hub, tokenizers, Trainer API, and community-contributed models.
The next chapter (Graph Neural Networks) will extend deep learning to graph-structured data — the natural representation for user-item interactions, social networks, and molecular structures — and will provide a graph-based alternative to the two-tower retrieval model built here.
Simplest Model That Works: The theme of this chapter, stated plainly: before building anything complex, check whether someone has already trained a model that solves your problem. Before training from scratch, try fine-tuning. Before fine-tuning the full model, try a linear probe. The best engineers are not the ones who build the most — they are the ones who build only what needs to be built.