26 min read

> — Andrej Karpathy, "A Recipe for Training Neural Networks" (2019)

Chapter 13: Transfer Learning, Foundation Models, and the Modern Deep Learning Workflow

"Don't be a hero. Use pretrained models." — Andrej Karpathy, "A Recipe for Training Neural Networks" (2019)


Learning Objectives

By the end of this chapter, you will be able to:

  1. Apply transfer learning strategies (feature extraction, fine-tuning, progressive unfreezing) for different data regimes
  2. Select and adapt pretrained foundation models for domain-specific tasks
  3. Design embedding pipelines that leverage pretrained models for downstream retrieval and classification
  4. Evaluate when to train from scratch vs. fine-tune vs. use a foundation model as-is
  5. Implement a complete modern DL workflow: pretrained backbone → adapter/fine-tune → evaluation → deployment

13.1 The Paradigm Shift: From Training to Adapting

Every model we have built so far in this book started with random weights. The MLP in Chapter 6 initialized its parameters from a Gaussian distribution. The CNN in Chapter 8 used Kaiming initialization. The transformer in Chapter 10 began as a randomly wired attention machine. In each case, the model learned everything — from low-level features to task-specific decision boundaries — from the labeled data you provided.

This approach has a name: training from scratch. And for most practitioners, in most situations, it is the wrong approach.

Here is the dirty secret of modern deep learning: almost no one trains from scratch anymore. The standard workflow at companies deploying deep learning — from two-person startups to Google — is:

  1. Find a pretrained model that was trained on a large, general-purpose dataset.
  2. Adapt it to your specific domain and task.
  3. Deploy.

This chapter explains why this workflow dominates, formalizes the strategies for adaptation, and teaches you to make the critical decisions: when to use a pretrained model, which pretrained model, how much to adapt, and how to evaluate whether the adaptation worked.

Simplest Model That Works: Transfer learning is the most powerful instantiation of this theme. A pretrained model already encodes millions of dollars worth of compute and vast quantities of data. Using it is not laziness — it is engineering discipline. Training from scratch when a pretrained model exists is like writing your own database engine when PostgreSQL is available: technically impressive, practically unwise, and likely to produce an inferior result.

The Economics of Pretraining

The economic argument for transfer learning is stark. Consider the compute required to train models from scratch:

Model Parameters Training Tokens/Images Estimated Compute Cost
ResNet-50 25M 1.2M images (ImageNet) ~$50 (4 GPU-days)
BERT-base 110M 3.3B tokens ~$5,000 (16 TPU-days)
ViT-Large 307M 14M images (ImageNet-21k) ~$15,000
GPT-3 175B 300B tokens ~$4,600,000
Llama 3 70B 70B 15T tokens ~$10,000,000+

Fine-tuning any of these models on your domain-specific data costs a small fraction of the pretraining budget: typically $10-$1,000 and a few hours on a single GPU. The pretrained model is a compressed representation of a massive dataset, and transfer learning lets you leverage that representation without paying the pretraining cost.

Why Transfer Learning Works: Feature Reuse

Transfer learning works because learned features are often general. The seminal insight came from Zeiler and Fergus (2014) and Yosinski et al. (2014), who visualized the features learned by CNNs trained on ImageNet:

  • Layer 1: Gabor-like edge detectors and color blobs — universal across all visual tasks.
  • Layer 2: Textures and simple patterns — still highly transferable.
  • Layer 3: Object parts and more complex patterns — partially transferable.
  • Layer 4+: Task-specific combinations — less transferable.

This hierarchy creates a transferability gradient: early layers learn general features that transfer well, while later layers learn task-specific features that may not. The same principle applies to language models, where early transformer layers learn syntactic patterns that transfer across tasks, while later layers encode more task-specific semantics.

Research Insight: Yosinski et al. (2014) quantified the transferability of features across layers by freezing various combinations of early and late layers. They found a striking result: even random features from a pretrained network outperformed learning from scratch with limited data. The pretrained network's features were not just good initializations — they represented a better region of the loss landscape that random initialization rarely finds.


13.2 The Transfer Learning Spectrum

Transfer learning is not a single technique but a spectrum of strategies. The right strategy depends on three factors:

  1. How much labeled data do you have? (100 examples vs. 10,000 vs. 1,000,000)
  2. How different is your domain from the pretraining domain? (natural images → medical images vs. natural images → satellite imagery)
  3. How much compute can you afford? (single GPU for an hour vs. 8 GPUs for a week)

We organize the spectrum from least to most adaptation:

graph LR
    A["Zero-Shot<br/>(No labeled data)"] --> B["Linear Probe<br/>(Frozen backbone + linear head)"]
    B --> C["Fine-Tuning<br/>(Unfreeze some/all layers)"]
    C --> D["Progressive Unfreezing<br/>(Gradual layer-by-layer)"]
    D --> E["Train from Scratch<br/>(Random initialization)"]
    style A fill:#e8f5e9
    style B fill:#c8e6c9
    style C fill:#a5d6a7
    style D fill:#81c784
    style E fill:#66bb6a

Strategy 1: Zero-Shot Inference

With zero-shot inference, you use a foundation model directly without any task-specific training. The model has learned such general representations during pretraining that it can handle your task out of the box.

When to use: No labeled data, or the task is well-represented in the pretraining distribution.

from transformers import pipeline

# Zero-shot text classification — no training required
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

result = classifier(
    "StreamRec user complained about too many cooking videos in their feed",
    candidate_labels=["content relevance", "technical issue", "billing", "account"],
)
print(result["labels"][0], f"({result['scores'][0]:.3f})")
# content relevance (0.891)

The model was trained on natural language inference (NLI), not customer support classification. Yet it achieves reasonable accuracy because NLI — determining whether a premise entails a hypothesis — is a general enough capability that it transfers to classification when you frame each label as a hypothesis.

Strategy 2: Feature Extraction (Linear Probe)

Feature extraction freezes the pretrained backbone entirely and trains only a new classification head (typically a single linear layer). The backbone acts as a fixed feature extractor.

When to use: Small labeled dataset (100–1,000 examples), domain similar to pretraining data.

import torch
import torch.nn as nn
from torchvision import models, transforms
from torch.utils.data import DataLoader, TensorDataset
from typing import Tuple
import numpy as np


class LinearProbe(nn.Module):
    """Feature extraction with a frozen pretrained backbone.

    The backbone is frozen (no gradient computation), and only
    the linear classification head is trained. This is equivalent
    to using the backbone as a fixed feature extractor.

    Args:
        backbone: Pretrained model (e.g., resnet50).
        feature_dim: Dimensionality of backbone output features.
        num_classes: Number of target classes.
    """

    def __init__(
        self, backbone: nn.Module, feature_dim: int, num_classes: int
    ) -> None:
        super().__init__()
        self.backbone = backbone
        # Freeze all backbone parameters
        for param in self.backbone.parameters():
            param.requires_grad = False
        self.head = nn.Linear(feature_dim, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        with torch.no_grad():
            features = self.backbone(x)
        return self.head(features)


def build_linear_probe(num_classes: int) -> LinearProbe:
    """Build a linear probe on top of a pretrained ResNet-50.

    Returns a model where only the final linear layer is trainable.
    The pretrained ResNet-50 backbone is used as a frozen feature extractor.

    Args:
        num_classes: Number of output classes.

    Returns:
        LinearProbe model ready for training.
    """
    # Load pretrained ResNet-50, remove its classification head
    resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
    feature_dim = resnet.fc.in_features  # 2048
    resnet.fc = nn.Identity()  # Replace FC with identity

    return LinearProbe(resnet, feature_dim, num_classes)


# Usage
model = build_linear_probe(num_classes=10)

# Count trainable parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Trainable fraction: {trainable_params / total_params:.4%}")
# Total parameters: 23,528,522
# Trainable parameters: 20,490
# Trainable fraction: 0.0871%

The linear probe has a remarkable property: because the backbone is frozen, it is convex in the head parameters. This means the optimal linear head can be found with ordinary least squares (or logistic regression for classification), and training is deterministic — no learning rate tuning, no local minima.

Common Misconception: "If a linear probe works well, the pretrained features are good. If it doesn't, the pretrained features are bad." This is only half right. A linear probe tests whether the pretrained features are linearly separable for your task. The features might contain all the information needed, but in a form that requires nonlinear combination. In such cases, fine-tuning or a small nonlinear head will substantially outperform the linear probe, even though the underlying features are the same.

Strategy 3: Fine-Tuning

Fine-tuning unfreezes some or all of the pretrained backbone and trains it jointly with the new head, using a small learning rate to avoid destroying the pretrained representations.

When to use: Moderate labeled dataset (1,000–100,000 examples), domain partially different from pretraining data.

The key insight is learning rate differential: the pretrained layers should update more slowly than the new head, because the pretrained representations are already good and we want to refine them, not overwrite them.

from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR


class FineTunedModel(nn.Module):
    """Fine-tuned pretrained model with differential learning rates.

    All backbone parameters are unfrozen and trained with a smaller
    learning rate than the new classification head.

    Args:
        backbone: Pretrained model.
        feature_dim: Dimensionality of backbone output features.
        num_classes: Number of target classes.
        dropout: Dropout probability before the classification head.
    """

    def __init__(
        self,
        backbone: nn.Module,
        feature_dim: int,
        num_classes: int,
        dropout: float = 0.2,
    ) -> None:
        super().__init__()
        self.backbone = backbone
        self.dropout = nn.Dropout(dropout)
        self.head = nn.Linear(feature_dim, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        features = self.backbone(x)
        features = self.dropout(features)
        return self.head(features)


def build_fine_tuned_model(num_classes: int) -> Tuple[FineTunedModel, AdamW]:
    """Build a fine-tuned ResNet-50 with differential learning rates.

    The backbone is trained with a 10x smaller learning rate than
    the classification head, preserving pretrained representations
    while allowing task-specific adaptation.

    Args:
        num_classes: Number of output classes.

    Returns:
        Tuple of (model, optimizer).
    """
    resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
    feature_dim = resnet.fc.in_features
    resnet.fc = nn.Identity()

    model = FineTunedModel(resnet, feature_dim, num_classes)

    # Differential learning rates: backbone lr = head lr / 10
    head_lr = 1e-3
    backbone_lr = head_lr / 10

    optimizer = AdamW(
        [
            {"params": model.backbone.parameters(), "lr": backbone_lr},
            {"params": model.head.parameters(), "lr": head_lr},
        ],
        weight_decay=0.01,
    )

    return model, optimizer

Strategy 4: Progressive Unfreezing

Progressive unfreezing, introduced by Howard and Ruder (2018) in their ULMFiT paper, is a disciplined approach to fine-tuning that unfreezes layers one group at a time, from the last (most task-specific) to the first (most general).

When to use: Moderate to large labeled dataset, risk of catastrophic forgetting (domain very different from pretraining data).

The procedure:

  1. Freeze all pretrained layers. Train only the new head for 1-2 epochs.
  2. Unfreeze the last pretrained layer group. Train for 1-2 epochs with a lower learning rate.
  3. Unfreeze the next layer group. Repeat.
  4. Continue until all layers are unfrozen.

At each stage, the learning rate for newly unfrozen layers is smaller than for previously unfrozen layers, creating a learning rate schedule that mirrors the transferability gradient.

from typing import List, Dict, Any


def progressive_unfreeze(
    model: nn.Module,
    layer_groups: List[List[nn.Parameter]],
    base_lr: float = 1e-3,
    lr_decay_factor: float = 0.3,
) -> List[Dict[str, Any]]:
    """Create parameter groups for progressive unfreezing.

    Each layer group gets a learning rate that is lr_decay_factor
    times smaller than the next group. The last group (classification
    head) gets base_lr, the second-to-last gets base_lr * lr_decay_factor,
    and so on.

    This implements "discriminative learning rates" from ULMFiT
    (Howard and Ruder, 2018).

    Args:
        model: The model to configure.
        layer_groups: List of parameter groups, from earliest to latest.
        base_lr: Learning rate for the last (newest) group.
        lr_decay_factor: Multiplicative decay per group.

    Returns:
        Parameter groups suitable for an optimizer.
    """
    n_groups = len(layer_groups)
    param_groups = []

    for i, params in enumerate(layer_groups):
        # Earlier groups get smaller learning rates
        group_lr = base_lr * (lr_decay_factor ** (n_groups - 1 - i))
        param_groups.append({"params": params, "lr": group_lr})

    return param_groups


def create_resnet_layer_groups(model: FineTunedModel) -> List[List[nn.Parameter]]:
    """Split a ResNet-50 into layer groups for progressive unfreezing.

    Groups (from earliest to latest):
        0: conv1, bn1 (stem)
        1: layer1 (first residual block)
        2: layer2
        3: layer3
        4: layer4
        5: head (classification layer)

    Args:
        model: FineTunedModel wrapping a ResNet-50 backbone.

    Returns:
        List of parameter lists, one per layer group.
    """
    backbone = model.backbone
    groups = [
        list(backbone.conv1.parameters()) + list(backbone.bn1.parameters()),
        list(backbone.layer1.parameters()),
        list(backbone.layer2.parameters()),
        list(backbone.layer3.parameters()),
        list(backbone.layer4.parameters()),
        list(model.head.parameters()),
    ]
    return groups


# Example: progressive unfreezing with discriminative LR
# model, _ = build_fine_tuned_model(num_classes=10)
# layer_groups = create_resnet_layer_groups(model)
# param_groups = progressive_unfreeze(model, layer_groups, base_lr=1e-3)
# optimizer = AdamW(param_groups, weight_decay=0.01)
# Stage 1: Freeze groups 0-4, train only group 5 (head)
# Stage 2: Unfreeze group 4 (layer4), train groups 4-5
# Stage 3: Unfreeze group 3 (layer3), train groups 3-5
# ... continue until all groups are unfrozen

Strategy 5: Training from Scratch

When should you train from scratch? Almost never — but there are genuine cases:

  1. Your domain is radically different from any pretraining data. Scientific imaging modalities (electron microscopy, radio astronomy) may have no useful pretrained model.
  2. You have enormous amounts of labeled data. With millions of labeled examples in your domain, a custom architecture and training procedure may outperform a pretrained model that carries irrelevant inductive biases.
  3. Latency or model size constraints. If you need a model under 1MB for edge deployment, no pretrained foundation model will fit. A custom-designed small architecture trained from scratch may be the only option.
  4. Regulatory or IP constraints. Some organizations cannot use models trained on data whose provenance is unclear.

Production Reality: In industry, the decision is rarely pure. Many teams start with a pretrained model, fine-tune it to establish a strong baseline, and then — only after validating that the task justifies the investment — explore training a custom model from scratch. The pretrained baseline gives you a performance target and a deployment timeline. Starting from scratch without that baseline is premature optimization.

The Decision Framework

The following decision tree captures the practical heuristic that most senior practitioners use:

graph TD
    A["How much labeled data?"] -->|"< 100 examples"| B["Zero-shot or few-shot<br/>with foundation model"]
    A -->|"100 - 1,000"| C["Linear probe<br/>(frozen backbone)"]
    A -->|"1,000 - 100,000"| D{"Domain distance?"}
    A -->|"> 100,000"| E{"Compute budget?"}
    D -->|"Small"| F["Full fine-tuning<br/>(differential LR)"]
    D -->|"Large"| G["Progressive unfreezing"]
    E -->|"Limited"| F
    E -->|"Substantial"| H["Consider training<br/>from scratch"]

    style B fill:#e3f2fd
    style C fill:#e3f2fd
    style F fill:#e3f2fd
    style G fill:#e3f2fd
    style H fill:#fff3e0

This is a heuristic, not a law. But it captures the overwhelming empirical evidence: for most tasks, with most data budgets, pretrained models outperform training from scratch. The burden of proof is on training from scratch, not on transfer learning.


13.3 Domain Adaptation and Domain Shift

Transfer learning assumes that features learned on a source domain are useful in a target domain. When this assumption holds, transfer works beautifully. When it breaks — when the source and target distributions differ substantially — we face domain shift, and transfer can actually hurt performance.

Formalizing Domain Shift

Let $p_S(\mathbf{x}, y)$ be the source (pretraining) distribution and $p_T(\mathbf{x}, y)$ the target distribution. Domain shift occurs when $p_S \neq p_T$. The shift can decompose into several types:

Type Formal Condition Example
Covariate shift $p_S(\mathbf{x}) \neq p_T(\mathbf{x})$, $p_S(y \mid \mathbf{x}) = p_T(y \mid \mathbf{x})$ ImageNet photos → satellite images (different input distribution, same labeling function)
Label shift $p_S(y) \neq p_T(y)$, $p_S(\mathbf{x} \mid y) = p_T(\mathbf{x} \mid y)$ Balanced training set → imbalanced deployment (different class frequencies)
Concept drift $p_S(y \mid \mathbf{x}) \neq p_T(y \mid \mathbf{x})$ User preferences change over time (same content, different engagement patterns)

Measuring Domain Distance

How similar are two domains? Several metrics quantify domain distance, and the choice matters for predicting whether transfer will succeed:

Maximum Mean Discrepancy (MMD). For feature representations $\phi(\mathbf{x})$ mapped into a reproducing kernel Hilbert space $\mathcal{H}$:

$$\text{MMD}(\mathcal{D}_S, \mathcal{D}_T) = \left\| \frac{1}{|\mathcal{D}_S|} \sum_{\mathbf{x} \in \mathcal{D}_S} \phi(\mathbf{x}) - \frac{1}{|\mathcal{D}_T|} \sum_{\mathbf{x} \in \mathcal{D}_T} \phi(\mathbf{x}) \right\|_{\mathcal{H}}$$

MMD equals zero when the source and target feature distributions are identical. In practice, we compute MMD using a Gaussian kernel over the penultimate layer features of the pretrained model.

Proxy A-Distance. Train a linear classifier to distinguish source from target examples. If the classifier achieves accuracy $A$, the proxy A-distance is $d_A = 2(1 - 2\epsilon)$ where $\epsilon = 1 - A$. If the classifier cannot distinguish the domains ($A \approx 0.5$), the domains are similar and transfer should work. If $A \approx 1.0$, the domains are very different.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score


def proxy_a_distance(
    source_features: np.ndarray, target_features: np.ndarray
) -> float:
    """Compute the Proxy A-Distance between source and target domains.

    Trains a logistic regression classifier to distinguish source from
    target features. High accuracy means the domains are very different
    (transfer may fail); accuracy near 0.5 means domains are similar
    (transfer should work).

    Reference: Ben-David et al., "A theory of learning from different
    domains" (Machine Learning, 2010).

    Args:
        source_features: Feature matrix from source domain (n_source, d).
        target_features: Feature matrix from target domain (n_target, d).

    Returns:
        Proxy A-distance (0 = identical domains, 2 = maximally different).
    """
    X = np.vstack([source_features, target_features])
    y = np.concatenate([
        np.zeros(len(source_features)),
        np.ones(len(target_features)),
    ])

    clf = LogisticRegression(max_iter=1000, C=1.0)
    accuracy = cross_val_score(clf, X, y, cv=5, scoring="accuracy").mean()
    error = 1.0 - accuracy

    return 2.0 * (1.0 - 2.0 * error)

Mathematical Foundation: Ben-David et al. (2010) proved a generalization bound for domain adaptation that explicitly includes the domain distance:

$$\epsilon_T(h) \leq \epsilon_S(h) + \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_S, \mathcal{D}_T) + \lambda^*$$

where $\epsilon_T(h)$ is the target error, $\epsilon_S(h)$ is the source error, $d_{\mathcal{H} \Delta \mathcal{H}}$ is the $\mathcal{H}$-divergence (related to proxy A-distance), and $\lambda^*$ is the error of the ideal joint hypothesis. This bound tells us that transfer learning can only work if (1) we have low error on the source, (2) the domains are close, and (3) there exists some hypothesis that works well on both. If any of these conditions fail, transfer is doomed.

Negative Transfer

When domain distance is large, fine-tuning a pretrained model can underperform training from scratch. This is called negative transfer, and it occurs when:

  • The pretrained features encode source-specific patterns that are actively misleading in the target domain.
  • The fine-tuning process cannot unlearn these patterns with the available target data.
  • The inductive biases from pretraining (e.g., translation invariance from CNNs trained on natural images) conflict with target domain structure.

The defense against negative transfer is monitoring: always compare your transferred model against a trained-from-scratch baseline (even a simple one) to verify that transfer is helping.


13.4 Self-Supervised Learning: How Pretrained Models Are Built

To use pretrained models effectively, you need to understand how they were trained. Most modern pretrained models use self-supervised learning — a training paradigm where the supervision signal comes from the data itself, not from human labels.

Masked Language Modeling (BERT)

BERT (Devlin et al., 2019) randomly masks 15% of input tokens and trains the model to predict them:

$$\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log p_\theta(x_i \mid \mathbf{x}_{\setminus \mathcal{M}})$$

where $\mathcal{M}$ is the set of masked positions and $\mathbf{x}_{\setminus \mathcal{M}}$ is the input with masked tokens replaced. The model must learn syntax, semantics, and world knowledge to fill in the blanks.

Masked Image Modeling (MAE)

Masked autoencoders (He et al., 2022) apply the same idea to images: mask 75% of image patches and train a vision transformer to reconstruct the missing patches. The high masking ratio (compared to 15% for text) works because images have higher spatial redundancy — neighboring patches are more predictable than neighboring words.

Contrastive Learning (SimCLR, DINO)

Contrastive learning learns representations by pulling together different views of the same data point and pushing apart views of different data points.

SimCLR (Chen et al., 2020) creates two augmented views of each image and trains the model so that representations of the same image are similar while representations of different images are dissimilar. The loss function is the normalized temperature-scaled cross-entropy (NT-Xent):

$$\mathcal{L}_{\text{SimCLR}} = -\log \frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k) / \tau)}$$

where $\mathbf{z}_i$ and $\mathbf{z}_j$ are the representations of two augmented views of the same image, $\text{sim}(\mathbf{u}, \mathbf{v}) = \mathbf{u}^\top \mathbf{v} / (\|\mathbf{u}\| \|\mathbf{v}\|)$ is cosine similarity, $\tau$ is a temperature parameter, and the sum in the denominator is over all $2N$ views in the batch (both augmented versions of all $N$ images), excluding the anchor $i$ itself.

DINO (Caron et al., 2021) uses a self-distillation framework: a student network learns to match the output of a teacher network (an exponential moving average of the student), where teacher and student see different augmented views. DINO is notable because its attention maps spontaneously learn to segment objects — an emergent property that was not explicitly trained for.

Contrastive Language-Image Pretraining (CLIP)

CLIP (Radford et al., 2021) extends contrastive learning to paired image-text data. Given a batch of $N$ (image, text) pairs, CLIP learns image and text encoders that maximize the cosine similarity of matching pairs while minimizing it for non-matching pairs:

$$\mathcal{L}_{\text{CLIP}} = -\frac{1}{2N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(\mathbf{v}_i, \mathbf{t}_j) / \tau)} + \log \frac{\exp(\text{sim}(\mathbf{t}_i, \mathbf{v}_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(\mathbf{t}_i, \mathbf{v}_j) / \tau)} \right]$$

where $\mathbf{v}_i$ is the image embedding and $\mathbf{t}_i$ is the text embedding for the $i$-th pair. The first term pulls image $i$ toward its correct text and away from all other texts; the second does the same from the text side.

CLIP's power comes from its scale (400M image-text pairs from the internet) and its generality: the learned embeddings support zero-shot classification by computing similarity between an image embedding and text embeddings of class descriptions ("a photo of a dog", "a photo of a cat").

from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch


def clip_zero_shot_classify(
    image_path: str, candidate_labels: list[str]
) -> dict[str, float]:
    """Zero-shot image classification using CLIP.

    Computes cosine similarity between the image embedding and
    text embeddings of each candidate label, returning normalized
    probabilities.

    Args:
        image_path: Path to the image file.
        candidate_labels: List of class descriptions.

    Returns:
        Dictionary mapping labels to probabilities.
    """
    model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    image = Image.open(image_path)
    prompts = [f"a photo of {label}" for label in candidate_labels]

    inputs = processor(
        text=prompts, images=image, return_tensors="pt", padding=True
    )
    outputs = model(**inputs)

    # Cosine similarity → softmax → probabilities
    logits_per_image = outputs.logits_per_image  # (1, n_labels)
    probs = logits_per_image.softmax(dim=1).squeeze()

    return {
        label: prob.item()
        for label, prob in zip(candidate_labels, probs)
    }

Research Insight: CLIP's training objective is symmetric — it can be read as "for each image, find its matching text" or "for each text, find its matching image." This symmetry makes it a natural fit for retrieval systems, where you need to score query-document pairs in both directions. The two-tower architecture we build for StreamRec (Section 13.6) uses the same principle: one tower for queries (users), one tower for documents (items), trained with contrastive loss.


13.5 Foundation Models and the New Workflow

The term foundation model (Bommasani et al., 2021) refers to a model trained on broad data that can be adapted to a wide range of downstream tasks. The defining characteristics are:

  1. Scale: Trained on massive datasets (billions of tokens or images).
  2. Generality: Representations transfer across many tasks without architectural modification.
  3. Emergence: Capabilities that were not explicitly trained for appear at sufficient scale (e.g., CLIP learning object segmentation, LLMs learning arithmetic).

Sentence Embeddings and Embedding Pipelines

For many production systems, the primary use of pretrained models is not classification or generation but embedding — mapping inputs to dense vector representations that can be used for retrieval, clustering, and similarity search.

The standard embedding pipeline:

graph LR
    A["Raw Input<br/>(text, image, audio)"] --> B["Pretrained Encoder<br/>(BERT, ViT, CLIP)"]
    B --> C["Pooling<br/>(CLS token, mean)"]
    C --> D["Embedding Vector<br/>(d = 384-1024)"]
    D --> E["Vector Database<br/>(FAISS, Pinecone)"]
    E --> F["Downstream Task<br/>(retrieval, clustering)"]

For text, the dominant approach uses sentence transformers (Reimers and Gurevych, 2019): BERT-like models fine-tuned with a contrastive objective so that similar sentences have similar embeddings.

from sentence_transformers import SentenceTransformer
import numpy as np


def build_item_embeddings(
    descriptions: list[str],
    model_name: str = "all-MiniLM-L6-v2",
    batch_size: int = 64,
) -> np.ndarray:
    """Encode item descriptions into dense embedding vectors.

    Uses a pretrained sentence transformer to generate embeddings
    that capture semantic similarity. Items with similar descriptions
    will have high cosine similarity in the embedding space.

    Args:
        descriptions: List of item description strings.
        model_name: Name of the sentence transformer model.
        batch_size: Encoding batch size.

    Returns:
        Embedding matrix of shape (n_items, embedding_dim).
    """
    model = SentenceTransformer(model_name)
    embeddings = model.encode(
        descriptions,
        batch_size=batch_size,
        show_progress_bar=True,
        normalize_embeddings=True,  # Unit norm for cosine similarity
    )
    return embeddings


# StreamRec: embed item catalog
# item_descriptions = [
#     "Documentary about the history of jazz music in New Orleans",
#     "Tutorial on advanced Python decorators and metaclasses",
#     "Cooking show featuring traditional Japanese ramen recipes",
# ]
# embeddings = build_item_embeddings(item_descriptions)
# print(embeddings.shape)  # (3, 384)

Adapters and Parameter-Efficient Fine-Tuning

Fine-tuning an entire foundation model is expensive in memory and storage. If you have 10 downstream tasks, do you need 10 copies of GPT-3? Adapters (Houlsby et al., 2019) solve this by inserting small trainable modules into the frozen pretrained model:

$$\mathbf{h} \leftarrow \mathbf{h} + f(\mathbf{h} \, W_{\text{down}}) \, W_{\text{up}}$$

where $W_{\text{down}} \in \mathbb{R}^{d \times r}$ projects the hidden state to a low-rank bottleneck of dimension $r \ll d$, $f$ is a nonlinearity, and $W_{\text{up}} \in \mathbb{R}^{r \times d}$ projects back. Only $W_{\text{down}}$ and $W_{\text{up}}$ are trained; the original model weights are frozen.

LoRA (Hu et al., 2022) takes a different approach: instead of adding new modules, it adds a low-rank perturbation to existing weight matrices:

$$W' = W + \Delta W = W + BA$$

where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times d}$, and $r \ll d$. The rank $r$ controls the tradeoff between expressiveness and parameter efficiency.

import torch.nn as nn
import torch
import math


class LoRALinear(nn.Module):
    """Linear layer with LoRA (Low-Rank Adaptation).

    Adds a low-rank perturbation BA to the frozen pretrained weight W,
    so the effective weight is W + BA. Only B and A are trained.

    Reference: Hu et al., "LoRA: Low-Rank Adaptation of Large Language
    Models" (ICLR, 2022).

    Args:
        pretrained_linear: Original frozen linear layer.
        rank: Rank of the LoRA decomposition.
        alpha: Scaling factor (effective lr multiplier = alpha / rank).
    """

    def __init__(
        self, pretrained_linear: nn.Linear, rank: int = 8, alpha: float = 16.0
    ) -> None:
        super().__init__()
        self.pretrained = pretrained_linear
        self.pretrained.weight.requires_grad = False
        if self.pretrained.bias is not None:
            self.pretrained.bias.requires_grad = False

        d_out, d_in = pretrained_linear.weight.shape
        self.lora_A = nn.Parameter(torch.randn(rank, d_in) / math.sqrt(d_in))
        self.lora_B = nn.Parameter(torch.zeros(d_out, rank))
        self.scaling = alpha / rank

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Original output + low-rank perturbation
        base_output = self.pretrained(x)
        lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        return base_output + lora_output


def apply_lora_to_model(
    model: nn.Module,
    target_modules: list[str],
    rank: int = 8,
    alpha: float = 16.0,
) -> nn.Module:
    """Apply LoRA to specified linear layers in a model.

    Replaces each target linear layer with a LoRALinear wrapper,
    freezing the original weights and adding trainable low-rank
    perturbation matrices.

    Args:
        model: The pretrained model.
        target_modules: Names of modules to apply LoRA to.
        rank: LoRA rank.
        alpha: LoRA scaling factor.

    Returns:
        Modified model with LoRA layers.
    """
    for name, module in model.named_modules():
        if any(target in name for target in target_modules):
            if isinstance(module, nn.Linear):
                parent_name = ".".join(name.split(".")[:-1])
                child_name = name.split(".")[-1]
                parent = model.get_submodule(parent_name) if parent_name else model
                setattr(parent, child_name, LoRALinear(module, rank, alpha))

    return model

Prompt tuning (Lester et al., 2021) takes parameter efficiency further: instead of modifying the model at all, it prepends a set of trainable "soft prompt" vectors to the input. The model is entirely frozen; only the prompt vectors (typically 10-100 tokens) are optimized. This is the most parameter-efficient adaptation method, but it works best for large language models where the prompt can steer behavior through the model's existing capabilities.

Method Trainable Parameters Storage per Task Inference Overhead
Full fine-tuning 100% Full model copy None
Adapter ~2-4% Adapter weights only Small forward pass overhead
LoRA ~0.1-1% LoRA matrices only None (merge at deployment)
Prompt tuning ~0.01% Prompt vectors only None

Production Reality: LoRA has become the default fine-tuning method in industry for models above ~1B parameters. The reason is practical: LoRA weights can be merged into the base model weights at deployment time ($W' = W + BA$), so there is zero inference overhead. This means you can serve one base model and swap task-specific LoRA adapters in and out with negligible latency cost. Multiple LoRA adapters can even be served simultaneously by batching requests for different tasks through the same base model.


13.6 Two-Tower Models and Contrastive Learning for Retrieval

The concepts from this chapter — pretrained encoders, embedding pipelines, contrastive learning — converge in the two-tower model, the workhorse architecture for modern retrieval systems. This is the architecture that powers the candidate retrieval stage at companies like Google (YouTube), Meta (Instagram), Spotify, and Netflix.

Architecture

A two-tower model has two independent encoder networks:

  • Query tower (user tower): Encodes the user's context (profile, history, current session) into a dense vector $\mathbf{q} \in \mathbb{R}^d$.
  • Item tower (document tower): Encodes each item (metadata, description, features) into a dense vector $\mathbf{k} \in \mathbb{R}^d$.

The relevance score is the inner product (or cosine similarity) between query and item embeddings:

$$\text{score}(q, k) = \mathbf{q}^\top \mathbf{k}$$

The two-tower architecture has a critical deployment advantage: item embeddings can be precomputed and stored in a vector index (FAISS, ScaNN). At serving time, only the query tower runs, and retrieval is a nearest-neighbor search — sublinear in the number of items.

graph LR
    subgraph "Query Tower"
        U1["User Profile"] --> UE["Pretrained<br/>Transformer"]
        U2["Watch History"] --> UE
        UE --> UP["Projection"]
        UP --> QV["q ∈ ℝ^d"]
    end
    subgraph "Item Tower"
        I1["Title + Description"] --> IE["Pretrained<br/>Text Encoder"]
        I2["Categories + Tags"] --> IE
        IE --> IP["Projection"]
        IP --> KV["k ∈ ℝ^d"]
    end
    QV -.->|"score = q · k"| KV

Contrastive Loss for Two-Tower Training

The two-tower model is trained with a contrastive loss. For each user $i$ in a batch, we have one positive item $k_i^+$ (the item the user actually engaged with) and $K$ negative items $\{k_j^-\}$ (items the user did not engage with). The loss is:

$$\mathcal{L}_{\text{contrast}} = -\sum_{i=1}^{N} \log \frac{\exp(\mathbf{q}_i^\top \mathbf{k}_i^+ / \tau)}{\exp(\mathbf{q}_i^\top \mathbf{k}_i^+ / \tau) + \sum_{j=1}^{K} \exp(\mathbf{q}_i^\top \mathbf{k}_j^- / \tau)}$$

This is the same InfoNCE loss (Oord et al., 2018) used in SimCLR and CLIP, applied to user-item pairs instead of image augmentations or image-text pairs.

Mathematical Foundation: The contrastive loss has a clean information-theoretic interpretation. Minimizing $\mathcal{L}_{\text{contrast}}$ is equivalent to maximizing a lower bound on the mutual information $I(\mathbf{q}; \mathbf{k}^+)$ between the query and positive item representations (Poole et al., 2019). The bound tightens with more negatives:

$$I(\mathbf{q}; \mathbf{k}^+) \geq \log K - \mathcal{L}_{\text{contrast}}$$

This explains why larger batch sizes (more in-batch negatives) improve contrastive learning performance: they tighten the mutual information bound. However, the bound saturates at $\log K$, so there are diminishing returns beyond a certain batch size.

Negative Sampling Strategies

The choice of negatives dramatically affects what the model learns:

  1. In-batch negatives: Use other items in the same batch as negatives. Simple and efficient, but biased toward popular items (popular items appear in more batches and receive more negative signal).
  2. Hard negatives: Mine items that are similar to the positive but not identical — items the user viewed but did not engage with, or items from the same category. Hard negatives force the model to make finer distinctions.
  3. Mixed negatives: Combine random negatives (for broad coverage) with hard negatives (for discrimination).
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional


class TwoTowerModel(nn.Module):
    """Two-tower model for user-item retrieval with contrastive learning.

    Each tower independently encodes its input into a shared embedding
    space. Training uses in-batch contrastive loss (InfoNCE) where all
    other items in the batch serve as negatives for each user.

    Args:
        user_encoder: Pretrained encoder for user features.
        item_encoder: Pretrained encoder for item features.
        embedding_dim: Dimensionality of the shared embedding space.
        temperature: Temperature parameter for contrastive loss.
    """

    def __init__(
        self,
        user_encoder: nn.Module,
        item_encoder: nn.Module,
        embedding_dim: int = 128,
        temperature: float = 0.07,
    ) -> None:
        super().__init__()
        self.user_encoder = user_encoder
        self.item_encoder = item_encoder
        self.temperature = temperature

        # Projection heads to shared embedding space
        # (These are always trained from scratch, even when encoders are frozen)
        self.user_projection = nn.Sequential(
            nn.Linear(embedding_dim, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, embedding_dim),
        )
        self.item_projection = nn.Sequential(
            nn.Linear(embedding_dim, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, embedding_dim),
        )

    def encode_user(self, user_features: torch.Tensor) -> torch.Tensor:
        """Encode user features into the shared embedding space.

        Args:
            user_features: User feature tensor.

        Returns:
            L2-normalized user embedding.
        """
        h = self.user_encoder(user_features)
        z = self.user_projection(h)
        return F.normalize(z, dim=-1)

    def encode_item(self, item_features: torch.Tensor) -> torch.Tensor:
        """Encode item features into the shared embedding space.

        Args:
            item_features: Item feature tensor.

        Returns:
            L2-normalized item embedding.
        """
        h = self.item_encoder(item_features)
        z = self.item_projection(h)
        return F.normalize(z, dim=-1)

    def forward(
        self,
        user_features: torch.Tensor,
        item_features: torch.Tensor,
    ) -> torch.Tensor:
        """Compute contrastive loss for a batch of user-item pairs.

        Each (user_i, item_i) is a positive pair. All other items in the
        batch serve as negatives for each user (in-batch negatives).

        Args:
            user_features: User features (batch_size, feature_dim).
            item_features: Item features (batch_size, feature_dim).

        Returns:
            Scalar contrastive loss (InfoNCE).
        """
        user_emb = self.encode_user(user_features)   # (B, d)
        item_emb = self.encode_item(item_features)    # (B, d)

        # Similarity matrix: (B, B)
        # sim[i, j] = cosine_similarity(user_i, item_j) / temperature
        logits = torch.mm(user_emb, item_emb.T) / self.temperature

        # Labels: user_i should match item_i (diagonal)
        labels = torch.arange(logits.size(0), device=logits.device)

        # Symmetric loss: user→item and item→user
        loss_u2i = F.cross_entropy(logits, labels)
        loss_i2u = F.cross_entropy(logits.T, labels)

        return (loss_u2i + loss_i2u) / 2.0


def train_two_tower(
    model: TwoTowerModel,
    user_features: torch.Tensor,
    item_features: torch.Tensor,
    epochs: int = 10,
    batch_size: int = 256,
    learning_rate: float = 1e-4,
) -> list[float]:
    """Train a two-tower model with contrastive learning.

    Args:
        model: TwoTowerModel instance.
        user_features: All user feature vectors (n_interactions, user_dim).
        item_features: Corresponding item feature vectors (n_interactions, item_dim).
        epochs: Number of training epochs.
        batch_size: Training batch size (larger = more negatives).
        learning_rate: Learning rate.

    Returns:
        List of per-epoch average losses.
    """
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=epochs
    )

    dataset = torch.utils.data.TensorDataset(user_features, item_features)
    loader = torch.utils.data.DataLoader(
        dataset, batch_size=batch_size, shuffle=True, drop_last=True
    )

    epoch_losses = []
    for epoch in range(epochs):
        total_loss = 0.0
        n_batches = 0
        for user_batch, item_batch in loader:
            optimizer.zero_grad()
            loss = model(user_batch, item_batch)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            n_batches += 1

        scheduler.step()
        avg_loss = total_loss / n_batches
        epoch_losses.append(avg_loss)
        print(f"Epoch {epoch + 1}/{epochs}, Loss: {avg_loss:.4f}")

    return epoch_losses

Evaluation Metrics for Retrieval

Two-tower models are evaluated with retrieval metrics, not classification metrics:

Recall@K: The fraction of relevant items that appear in the top-K retrieved items:

$$\text{Recall@}K = \frac{|\text{relevant items in top-}K|}{|\text{all relevant items}|}$$

Hit Rate@K (equivalent to Recall@1 when there is one relevant item): Whether the relevant item appears in the top-K:

$$\text{HR@}K = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\text{relevant item}_i \in \text{top-}K_i]$$

Mean Reciprocal Rank (MRR): The average of the reciprocal of the rank of the first relevant item:

$$\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}$$

def evaluate_retrieval(
    user_embeddings: torch.Tensor,
    item_embeddings: torch.Tensor,
    ground_truth: torch.Tensor,
    k_values: list[int] = [1, 5, 10, 50, 100],
) -> dict[str, float]:
    """Evaluate retrieval model with standard metrics.

    Computes the full similarity matrix and ranks items for each user.

    Args:
        user_embeddings: Query embeddings (n_users, d).
        item_embeddings: Item embeddings (n_items, d).
        ground_truth: True relevant item index for each user (n_users,).
        k_values: List of K values for Recall@K and HR@K.

    Returns:
        Dictionary of metric_name → value.
    """
    # Compute similarity matrix
    similarity = torch.mm(user_embeddings, item_embeddings.T)  # (n_users, n_items)

    # Get ranks of ground truth items
    sorted_indices = similarity.argsort(dim=1, descending=True)
    ranks = torch.zeros(len(ground_truth), dtype=torch.long)
    for i, gt in enumerate(ground_truth):
        ranks[i] = (sorted_indices[i] == gt).nonzero(as_tuple=True)[0][0] + 1

    metrics = {}
    for k in k_values:
        metrics[f"HR@{k}"] = (ranks <= k).float().mean().item()
        metrics[f"Recall@{k}"] = (ranks <= k).float().mean().item()

    metrics["MRR"] = (1.0 / ranks.float()).mean().item()
    metrics["Median_Rank"] = ranks.float().median().item()

    return metrics

13.7 The HuggingFace Ecosystem

The practical infrastructure for transfer learning centers on HuggingFace, which has become the de facto standard for sharing and using pretrained models. Understanding the HuggingFace ecosystem is a professional skill for any practitioner working with deep learning.

Model Hub

The Model Hub hosts over 500,000 pretrained models across modalities (text, image, audio, video, multimodal). Each model page includes:

  • Model card: Documentation of architecture, training data, intended use, limitations, and bias considerations.
  • Inference API: Test the model directly in the browser.
  • Download metrics: How many people use this model (a rough quality signal).

The transformers Library

The transformers library provides a unified API for loading and using pretrained models:

from transformers import AutoModel, AutoTokenizer, AutoConfig

# Load any model with the Auto API
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Inspect the model
config = AutoConfig.from_pretrained(model_name)
print(f"Hidden size: {config.hidden_size}")          # 768
print(f"Num layers: {config.num_hidden_layers}")     # 12
print(f"Num heads: {config.num_attention_heads}")    # 12
print(f"Vocab size: {config.vocab_size}")             # 30522

# Encode text
inputs = tokenizer(
    "StreamRec recommends jazz documentaries",
    return_tensors="pt",
    padding=True,
    truncation=True,
)
outputs = model(**inputs)

# outputs.last_hidden_state: (batch, seq_len, hidden_size)
# outputs.pooler_output: (batch, hidden_size) — CLS token representation
print(f"Sequence output shape: {outputs.last_hidden_state.shape}")
print(f"Pooled output shape: {outputs.pooler_output.shape}")

The Trainer API

For fine-tuning, the Trainer class handles the training loop, evaluation, logging, checkpointing, and distributed training:

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset
import numpy as np


def fine_tune_classifier(
    model_name: str = "distilbert-base-uncased",
    dataset_name: str = "imdb",
    num_epochs: int = 3,
    batch_size: int = 16,
    learning_rate: float = 2e-5,
    output_dir: str = "./results",
) -> Trainer:
    """Fine-tune a pretrained transformer for text classification.

    Uses the HuggingFace Trainer API, which handles the training loop,
    gradient accumulation, mixed precision, logging, and checkpointing.

    Args:
        model_name: Name of the pretrained model on HuggingFace Hub.
        dataset_name: Name of the dataset on HuggingFace Hub.
        num_epochs: Number of fine-tuning epochs.
        batch_size: Per-device batch size.
        learning_rate: Peak learning rate (with linear warmup).
        output_dir: Directory for checkpoints and logs.

    Returns:
        Trained Trainer instance.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=2
    )

    dataset = load_dataset(dataset_name)

    def tokenize(examples):
        return tokenizer(
            examples["text"],
            padding="max_length",
            truncation=True,
            max_length=256,
        )

    tokenized = dataset.map(tokenize, batched=True)

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        preds = np.argmax(logits, axis=-1)
        accuracy = (preds == labels).mean()
        return {"accuracy": accuracy}

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size * 2,
        learning_rate=learning_rate,
        weight_decay=0.01,
        warmup_ratio=0.1,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        fp16=True,
        logging_steps=100,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized["train"],
        eval_dataset=tokenized["test"],
        compute_metrics=compute_metrics,
    )

    trainer.train()
    return trainer

Choosing a Pretrained Model

The abundance of pretrained models creates a new problem: how to choose. Here is a practical framework:

Factor What to Check Why It Matters
Task alignment Was the model pretrained on a similar task? Closer pretraining tasks transfer better
Domain alignment Was the model trained on data from your domain? Domain-specific models (BioBERT, FinBERT) outperform general ones
Model size Can you serve it within your latency budget? Larger models are better but slower
License Is the license compatible with your use case? Apache 2.0 vs. research-only vs. commercial
Community adoption Downloads, citations, leaderboard performance Widely used models are better debugged
Recency When was it trained? Newer models generally outperform older ones

Implementation Note: HuggingFace models cache downloads in ~/.cache/huggingface/. On a shared server, set HF_HOME to a shared directory to avoid redundant downloads. For air-gapped environments, use huggingface-cli download to pre-download models, or use model.save_pretrained() and model.from_pretrained(local_path) to load from local disk.


13.8 The Complete Modern DL Workflow

We now have all the pieces to describe the complete workflow that a senior deep learning practitioner follows for a new task.

Step 1: Problem Formulation

Define the task, the input/output format, and the evaluation metric. For StreamRec retrieval:

  • Task: Given a user's profile and history, retrieve the top-K items most likely to be engaged with.
  • Input: User features (profile, watch history). Item features (metadata, description).
  • Output: Ranked list of items.
  • Metric: Hit Rate@100, MRR.

Step 2: Baseline with Existing Pretrained Model

Start with a zero-shot or linear probe baseline. This takes hours, not weeks, and establishes whether your problem is already solved.

# Step 2a: Embed all items using a pretrained sentence transformer
# item_embeddings = build_item_embeddings(item_descriptions)

# Step 2b: Embed user queries as text ("user who watches jazz, cooking, tech")
# query_embeddings = build_item_embeddings(user_summaries)

# Step 2c: Nearest neighbor retrieval
# similarities = query_embeddings @ item_embeddings.T
# top_k = similarities.argsort(axis=1)[:, -100:][:, ::-1]

# Step 2d: Evaluate
# baseline_hr100 = hit_rate_at_k(top_k, ground_truth, k=100)
# print(f"Zero-shot HR@100: {baseline_hr100:.3f}")

Step 3: Fine-Tune or Adapt

If the baseline is promising but insufficient, fine-tune the pretrained model on your labeled data. For two-tower retrieval, this means training with contrastive loss on user-item engagement data.

Step 4: Evaluate Rigorously

Evaluate on a held-out test set that reflects production conditions:

  • Temporal split: Train on data before time $t$, test on data after time $t$. Never random split for recommendation data — this leaks future information.
  • Cold-start evaluation: Separately evaluate on new users and new items that were not in the training set.
  • Fairness audit: Check retrieval quality across user demographics and content categories.

Step 5: Deploy

The two-tower architecture enables a clean deployment pattern:

  1. Offline: Run the item tower on all items, store embeddings in a vector index (FAISS).
  2. Online: Run the user tower on the current user's features, retrieve top-K from the vector index.
  3. Refresh: Re-index items periodically (daily or weekly) or when new items are added.
import faiss


def build_faiss_index(
    item_embeddings: np.ndarray,
    index_type: str = "IVFFlat",
    nlist: int = 256,
) -> faiss.Index:
    """Build a FAISS index for fast approximate nearest neighbor search.

    Args:
        item_embeddings: Item embedding matrix (n_items, d).
        index_type: Type of FAISS index.
        nlist: Number of Voronoi cells for IVF index.

    Returns:
        Trained FAISS index.
    """
    d = item_embeddings.shape[1]
    item_embeddings = item_embeddings.astype(np.float32)

    if index_type == "FlatIP":
        # Exact inner product search (brute force)
        index = faiss.IndexFlatIP(d)
    elif index_type == "IVFFlat":
        # Approximate search with inverted file index
        quantizer = faiss.IndexFlatIP(d)
        index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
        index.train(item_embeddings)
        index.nprobe = 32  # Search 32 of 256 cells
    else:
        raise ValueError(f"Unknown index type: {index_type}")

    index.add(item_embeddings)
    return index


def retrieve_top_k(
    user_embedding: np.ndarray,
    index: faiss.Index,
    k: int = 100,
) -> tuple[np.ndarray, np.ndarray]:
    """Retrieve top-K items for a user using FAISS.

    Args:
        user_embedding: User query embedding (1, d) or (d,).
        index: Trained FAISS index.
        k: Number of items to retrieve.

    Returns:
        Tuple of (scores, item_indices), each of shape (1, k).
    """
    if user_embedding.ndim == 1:
        user_embedding = user_embedding.reshape(1, -1)
    user_embedding = user_embedding.astype(np.float32)

    scores, indices = index.search(user_embedding, k)
    return scores, indices

Production Reality: At scale, the two-tower + FAISS pipeline handles millions of items with sub-10ms latency. The item tower runs offline (batch inference), so its compute cost is amortized across all users. The user tower must run online (per-request), so it must be fast — typically a single transformer forward pass. This is why two-tower models use relatively small encoders (BERT-base, not BERT-large) for the query tower in production: the latency constraint dominates.


13.9 StreamRec Progressive Project — Milestone M5

This milestone integrates the concepts from this chapter into the StreamRec recommendation system. You will build the two-tower retrieval model that becomes the candidate generation stage of the full pipeline.

Task

Build a two-tower retrieval model for StreamRec:

  1. User tower: Encode user profile and watch history using a pretrained transformer.
  2. Item tower: Encode item metadata (title, description, tags) using a pretrained sentence transformer.
  3. Training: Contrastive loss with in-batch negatives.
  4. Evaluation: HR@10, HR@100, MRR on a temporal test split.
  5. Deployment: Index item embeddings with FAISS for real-time retrieval.

Implementation Skeleton

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import Dict, List, Tuple


class StreamRecUserTower(nn.Module):
    """User tower for StreamRec two-tower retrieval.

    Encodes user watch history as a sequence of item embeddings,
    processes through a small transformer, and projects to the
    shared embedding space.

    Args:
        pretrained_model: Name of the pretrained transformer.
        embedding_dim: Output embedding dimensionality.
        max_history: Maximum number of history items to consider.
    """

    def __init__(
        self,
        pretrained_model: str = "distilbert-base-uncased",
        embedding_dim: int = 128,
        max_history: int = 50,
    ) -> None:
        super().__init__()
        self.encoder = AutoModel.from_pretrained(pretrained_model)
        hidden_size = self.encoder.config.hidden_size  # 768 for distilbert

        # Freeze encoder, train projection
        for param in self.encoder.parameters():
            param.requires_grad = False

        self.projection = nn.Sequential(
            nn.Linear(hidden_size, 256),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(256, embedding_dim),
        )

    def forward(
        self, input_ids: torch.Tensor, attention_mask: torch.Tensor
    ) -> torch.Tensor:
        """Encode user history into an embedding vector.

        Args:
            input_ids: Tokenized user history (batch, seq_len).
            attention_mask: Attention mask (batch, seq_len).

        Returns:
            L2-normalized user embedding (batch, embedding_dim).
        """
        with torch.no_grad():
            outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)

        # Mean pooling over non-padding tokens
        hidden = outputs.last_hidden_state  # (batch, seq_len, hidden)
        mask = attention_mask.unsqueeze(-1).float()
        pooled = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1e-9)

        projected = self.projection(pooled)
        return F.normalize(projected, dim=-1)


class StreamRecItemTower(nn.Module):
    """Item tower for StreamRec two-tower retrieval.

    Encodes item title and description using a pretrained sentence
    transformer, then projects to the shared embedding space.

    Args:
        sentence_model: Name of the sentence transformer model.
        embedding_dim: Output embedding dimensionality.
    """

    def __init__(
        self,
        sentence_model: str = "all-MiniLM-L6-v2",
        embedding_dim: int = 128,
    ) -> None:
        super().__init__()
        self.encoder = SentenceTransformer(sentence_model)
        encoder_dim = self.encoder.get_sentence_embedding_dimension()  # 384

        # Freeze encoder, train projection
        for param in self.encoder.parameters():
            param.requires_grad = False

        self.projection = nn.Sequential(
            nn.Linear(encoder_dim, 256),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(256, embedding_dim),
        )

    def forward(self, item_embeddings: torch.Tensor) -> torch.Tensor:
        """Project pre-computed item embeddings to the shared space.

        In practice, item embeddings from the sentence transformer
        are pre-computed offline. This forward pass only applies
        the learned projection.

        Args:
            item_embeddings: Pre-encoded item features (batch, encoder_dim).

        Returns:
            L2-normalized item embedding (batch, embedding_dim).
        """
        projected = self.projection(item_embeddings)
        return F.normalize(projected, dim=-1)


class StreamRecRetrieval(nn.Module):
    """Two-tower retrieval model for StreamRec (Milestone M5).

    Combines user and item towers with contrastive (InfoNCE) loss.
    Designed for the candidate retrieval stage of the recommendation
    pipeline, preceding the transformer ranking model from M4.

    Args:
        embedding_dim: Shared embedding space dimensionality.
        temperature: Contrastive loss temperature.
    """

    def __init__(
        self, embedding_dim: int = 128, temperature: float = 0.05
    ) -> None:
        super().__init__()
        self.user_tower = StreamRecUserTower(embedding_dim=embedding_dim)
        self.item_tower = StreamRecItemTower(embedding_dim=embedding_dim)
        self.temperature = temperature

    def contrastive_loss(
        self,
        user_emb: torch.Tensor,
        item_emb: torch.Tensor,
    ) -> torch.Tensor:
        """Compute symmetric InfoNCE loss with in-batch negatives.

        Args:
            user_emb: Normalized user embeddings (B, d).
            item_emb: Normalized item embeddings (B, d).

        Returns:
            Scalar contrastive loss.
        """
        logits = torch.mm(user_emb, item_emb.T) / self.temperature
        labels = torch.arange(logits.size(0), device=logits.device)
        loss_u2i = F.cross_entropy(logits, labels)
        loss_i2u = F.cross_entropy(logits.T, labels)
        return (loss_u2i + loss_i2u) / 2.0

Track Guidance

  • Track A (Minimal): Freeze both encoders, train only projection heads. Evaluate HR@100 on temporal test split. Target: HR@100 > 0.15.
  • Track B (Standard): Unfreeze the user tower encoder with differential learning rates. Add hard negative mining (sample items the user viewed but did not complete). Target: HR@100 > 0.25.
  • Track C (Full): Fine-tune both towers with progressive unfreezing. Implement mixed negatives (50% in-batch, 50% hard). Add cross-modal retrieval: use CLIP to embed item thumbnails alongside text, concatenate embeddings. Target: HR@100 > 0.30.

Connection to Previous and Future Milestones

Milestone Chapter Component
M1 (Ch. 5) LSH/FAISS ANN The vector index infrastructure reused here
M3 (Ch. 8) 1D CNN content embeddings Replaced by pretrained sentence transformer embeddings
M4 (Ch. 10) Transformer session model Becomes the ranking model after two-tower retrieval
M5 (Ch. 13) Two-tower retrieval Candidate generation with pretrained encoders
M6 (Ch. 14) GNN collaborative filtering Alternative retrieval using graph structure

13.10 Climate DL: Fine-Tuning a Vision Transformer for Satellite Imagery

The Climate Deep Learning anchor provides a concrete example of fine-tuning in a domain (satellite imagery) that is moderately distant from typical pretraining data (ImageNet).

The Problem

The Pacific Climate Research Consortium (PCRC) needs to classify satellite imagery into land-use categories (forest, cropland, urban, water, barren, wetland) to track deforestation, urbanization, and land degradation. They have 8,000 labeled satellite images — too few to train a vision transformer from scratch, but enough for effective fine-tuning.

The Approach

Fine-tune a ViT (Vision Transformer) pretrained on ImageNet-21k:

from transformers import ViTForImageClassification, ViTFeatureExtractor
import torch
import torch.nn as nn


def build_satellite_classifier(
    num_classes: int = 6,
    pretrained_model: str = "google/vit-base-patch16-224",
    freeze_backbone: bool = False,
) -> ViTForImageClassification:
    """Build a satellite image classifier by fine-tuning a pretrained ViT.

    Loads a ViT pretrained on ImageNet-21k and replaces the classification
    head for the satellite land-use classification task.

    Args:
        num_classes: Number of land-use categories.
        pretrained_model: HuggingFace model identifier.
        freeze_backbone: If True, freeze all backbone layers (linear probe).

    Returns:
        ViT model ready for fine-tuning.
    """
    model = ViTForImageClassification.from_pretrained(
        pretrained_model,
        num_labels=num_classes,
        ignore_mismatched_sizes=True,  # Replace head with correct size
    )

    if freeze_backbone:
        for name, param in model.named_parameters():
            if "classifier" not in name:
                param.requires_grad = False

    # Count parameters
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Total: {total:,} | Trainable: {trainable:,} ({100*trainable/total:.1f}%)")

    return model


# Linear probe: only train the classification head
# model_probe = build_satellite_classifier(freeze_backbone=True)
# Total: 85,802,502 | Trainable: 4,614 (0.0%)

# Full fine-tuning: train everything
# model_ft = build_satellite_classifier(freeze_backbone=False)
# Total: 85,802,502 | Trainable: 85,802,502 (100.0%)

Domain Distance Analysis

Satellite imagery differs from ImageNet in several ways:

  • Viewing angle: Top-down (nadir) vs. ground-level perspective.
  • Color distribution: Vegetation indices (NDVI), false-color composites vs. natural color photographs.
  • Texture patterns: Agricultural grids, forest canopy textures vs. animal fur, fabric patterns.
  • Scale: A "building" in ImageNet fills the frame; in satellite imagery, a building is a few pixels.

Despite these differences, the early-layer features of ImageNet-pretrained models (edges, textures, color gradients) transfer reasonably well to satellite imagery. The later-layer features (object parts, scene layouts) transfer less well, which is why fine-tuning the full model outperforms a linear probe for this domain.

Research Insight: Neumann et al. (2019) and Manas et al. (2021) showed that self-supervised pretraining on unlabeled satellite imagery (using SimCLR or DINO) substantially outperforms ImageNet pretraining for remote sensing tasks. The key insight is that the pretraining domain matters as much as the pretraining method: features learned from satellite data are better priors for satellite tasks, even when the self-supervised pretraining uses no labels. If PCRC had access to a large corpus of unlabeled satellite images, self-supervised pretraining on that corpus followed by supervised fine-tuning on the 8,000 labeled images would be the strongest approach.


13.11 Putting It All Together

This chapter has covered the full landscape of modern deep learning practice — the workflow that professionals actually follow, rather than the train-from-scratch narrative that dominates textbooks. The core ideas:

  1. Pretrained models encode massive datasets as reusable features. Using them is not a shortcut — it is the engineering-correct approach that respects the economics of compute and data.

  2. The transfer learning spectrum (zero-shot → linear probe → fine-tuning → progressive unfreezing → train from scratch) maps to a decision framework based on labeled data, domain distance, and compute budget.

  3. Self-supervised learning (masked modeling, contrastive learning) is how modern pretrained models are built. Understanding the pretraining objective tells you what the model learned and what it did not.

  4. Foundation models (BERT, ViT, CLIP) have shifted the economics of deep learning: the cost of solving a new task has dropped from "train a model for weeks" to "fine-tune for hours."

  5. Two-tower models with contrastive learning are the standard architecture for large-scale retrieval, using pretrained encoders for both query and document towers.

  6. Parameter-efficient adaptation (LoRA, adapters, prompt tuning) enables fine-tuning models that are too large to fully train, making the benefits of scale accessible to practitioners without massive compute budgets.

  7. The HuggingFace ecosystem is the practical infrastructure that makes all of this work: model hub, tokenizers, Trainer API, and community-contributed models.

The next chapter (Graph Neural Networks) will extend deep learning to graph-structured data — the natural representation for user-item interactions, social networks, and molecular structures — and will provide a graph-based alternative to the two-tower retrieval model built here.

Simplest Model That Works: The theme of this chapter, stated plainly: before building anything complex, check whether someone has already trained a model that solves your problem. Before training from scratch, try fine-tuning. Before fine-tuning the full model, try a linear probe. The best engineers are not the ones who build the most — they are the ones who build only what needs to be built.