Chapter 13: Exercises

DataField.Dev

Chapter 13: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field

Transfer Learning Fundamentals

Exercise 13.1 (*)

A computer vision team has 500 labeled images of manufacturing defects (cracks, scratches, discoloration) on metal surfaces. They want to build a defect classifier.

(a) Using the decision framework from Section 13.2, recommend a transfer learning strategy. Justify your choice by considering labeled data volume, domain distance from ImageNet, and compute budget.

(b) Would you expect a linear probe on a pretrained ResNet-50 to outperform training a small CNN (3 convolutional layers, ~500K parameters) from scratch on 500 images? Explain why or why not.

(c) The team acquires 50,000 additional labeled images. Does your recommendation change? Why?

Exercise 13.2 (*)

Load a pretrained ResNet-50 from torchvision and extract features from the CIFAR-10 test set:

import torch
from torchvision import models, transforms, datasets

# Load pretrained ResNet-50
resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
resnet.fc = torch.nn.Identity()
resnet.eval()

# CIFAR-10 test set with ImageNet preprocessing
transform = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
test_set = datasets.CIFAR10(root="./data", train=False, download=True, transform=transform)

(a) Extract 2048-dimensional features for all 10,000 test images. Train a logistic regression classifier on these features using 80% for training, 20% for validation. Report accuracy.

(b) Compare with a logistic regression trained on raw pixel values (32x32x3 = 3072 features). How large is the gap?

(c) Repeat part (a) using features from layer3 instead of the final layer (use a forward hook to capture intermediate features). Is the accuracy higher or lower than using the final layer? Relate your finding to the transferability gradient discussed in Section 13.1.

Exercise 13.3 (*)

Consider the following pretrained model selection scenario for a healthcare startup building a clinical note classifier (8 classes: diagnosis, medication, procedure, lab result, allergy, social history, family history, assessment).

Model	Parameters	Pretraining Data	License
BERT-base	110M	Wikipedia + BookCorpus	Apache 2.0
BioBERT	110M	PubMed + PMC	Apache 2.0
ClinicalBERT	110M	MIMIC-III clinical notes	PhysioNet license
GPT-3.5	175B	Broad internet text	Proprietary API

(a) Rank these models for the clinical note task, considering task alignment, domain alignment, model size, license, and community adoption. Justify your ranking.

(b) The startup has 2,000 labeled clinical notes and a budget of $500/month for compute. Which model would you recommend and which transfer strategy (zero-shot, linear probe, fine-tuning)?

(c) What privacy concerns arise from using ClinicalBERT (pretrained on MIMIC-III patient data)?

Exercise 13.4 (*)

Use the proxy_a_distance function from Section 13.3 to measure the domain distance between:

import numpy as np
from sklearn.datasets import fetch_openml

# Source: MNIST digits
mnist = fetch_openml("mnist_784", version=1, as_frame=False)
source = mnist.data[:5000].astype(np.float32) / 255.0

# Target 1: Fashion-MNIST
fashion = fetch_openml("Fashion-MNIST", version=1, as_frame=False)
target_fashion = fashion.data[:5000].astype(np.float32) / 255.0

# Target 2: MNIST rotated 45 degrees (simulate via random perturbation)
rng = np.random.RandomState(42)
target_rotated = source + rng.normal(0, 0.1, source.shape)
target_rotated = np.clip(target_rotated, 0, 1).astype(np.float32)

(a) Compute the proxy A-distance between MNIST and Fashion-MNIST, and between MNIST and rotated MNIST. Which domain is closer?

(b) Based on the distances, predict which transfer scenario would benefit more from fine-tuning vs. linear probing.

(c) Now extract features using a pretrained ResNet-50 (resize images to 224x224, 1-channel → 3-channel by repeating). Recompute proxy A-distances on the ResNet features. Do the distances change? What does this tell you about feature-space domain distance vs. pixel-space domain distance?

Fine-Tuning and Progressive Unfreezing

Exercise 13.5 (**)

Implement a learning rate finder for fine-tuning. Starting from a very small learning rate ($10^{-7}$), exponentially increase it over one epoch and record the loss at each step:

import torch
from torch.optim import SGD
import math


def lr_range_test(
    model: torch.nn.Module,
    train_loader: torch.utils.data.DataLoader,
    criterion: torch.nn.Module,
    start_lr: float = 1e-7,
    end_lr: float = 1e-1,
) -> tuple[list[float], list[float]]:
    """Learning rate range test for finding optimal fine-tuning LR.

    Increases learning rate exponentially from start_lr to end_lr
    over one epoch, recording loss at each step.

    Args:
        model: The model to test.
        train_loader: Training data loader.
        criterion: Loss function.
        start_lr: Starting (minimum) learning rate.
        end_lr: Ending (maximum) learning rate.

    Returns:
        Tuple of (learning_rates, losses).
    """
    optimizer = SGD(model.parameters(), lr=start_lr)
    n_steps = len(train_loader)
    gamma = (end_lr / start_lr) ** (1 / n_steps)

    lrs, losses = [], []
    for i, (inputs, targets) in enumerate(train_loader):
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        current_lr = start_lr * (gamma ** i)
        for pg in optimizer.param_groups:
            pg["lr"] = current_lr

        lrs.append(current_lr)
        losses.append(loss.item())

        if loss.item() > 4 * min(losses):
            break  # Loss diverged

    return lrs, losses

(a) Run this on a pretrained ResNet-50 fine-tuned on CIFAR-10. Plot learning rate (log scale) vs. loss. Identify the optimal learning rate (the point of steepest descent).

(b) The optimal fine-tuning learning rate is typically 10-100x smaller than the optimal learning rate for training from scratch. Using your experiment, verify or refute this claim. Explain why the difference exists.

(c) Modify the function to support differential learning rates: run separate range tests for the backbone and head parameter groups. Do they have different optimal learning rates?

Exercise 13.6 (**)

Implement progressive unfreezing for a pretrained BERT model fine-tuned on a text classification task.

(a) Split BERT into 7 groups: embedding layer, layers 0-1, layers 2-3, layers 4-5, layers 6-7, layers 8-9, layers 10-11 + pooler, classification head. Implement the progressive unfreezing schedule: unfreeze one group every 2 epochs, training for 14 epochs total.

(b) Compare three strategies on a text classification dataset (e.g., AG News from HuggingFace datasets): - Full fine-tuning from epoch 1 (unfreeze everything, single learning rate) - Full fine-tuning with differential learning rates (10x per group) - Progressive unfreezing with differential learning rates

Report accuracy and training loss curves for each. Which converges fastest? Which achieves the best final accuracy?

(c) Does progressive unfreezing prevent catastrophic forgetting? To test this, evaluate the fine-tuned model on the original pretraining task (masked language modeling) after fine-tuning with and without progressive unfreezing. Which preserves more of the original capability?

Exercise 13.7 (**)

Catastrophic forgetting occurs when fine-tuning overwrites the pretrained representations.

(a) Fine-tune a pretrained BERT model on a small dataset (1,000 examples from IMDB) for 20 epochs with a large learning rate ($5 \times 10^{-4}$). Plot the training accuracy, validation accuracy, and the L2 distance $\|\theta_t - \theta_0\|$ between the current and initial parameters at each epoch. When does overfitting begin? Is it correlated with parameter drift?

(b) Implement elastic weight consolidation (EWC) (Kirkpatrick et al., 2017), which adds a penalty for drifting from the pretrained weights:

$$\mathcal{L}_{\text{EWC}} = \mathcal{L}_{\text{task}} + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_i^*)^2$$

where $F_i$ is the Fisher information for parameter $i$ and $\theta_i^*$ are the pretrained weights. Use a diagonal approximation to $F_i$ (the squared gradient averaged over a sample of pretraining data). Compare with standard fine-tuning.

(c) Compare EWC with a simpler approach: L2 regularization toward pretrained weights $\frac{\lambda}{2} \|\theta - \theta^*\|^2$. Under what conditions does the Fisher information weighting matter?

Contrastive Learning and Two-Tower Models

Exercise 13.8 (**)

Implement SimCLR from scratch for CIFAR-10:

import torch
import torch.nn as nn
import torchvision.transforms as T


class SimCLRAugmentation:
    """Data augmentation pipeline for SimCLR.

    Creates two random augmented views of each image.
    """

    def __init__(self, size: int = 32) -> None:
        self.transform = T.Compose([
            T.RandomResizedCrop(size, scale=(0.2, 1.0)),
            T.RandomHorizontalFlip(),
            T.RandomApply([T.ColorJitter(0.4, 0.4, 0.4, 0.1)], p=0.8),
            T.RandomGrayscale(p=0.2),
            T.ToTensor(),
            T.Normalize([0.4914, 0.4822, 0.4465], [0.247, 0.243, 0.261]),
        ])

    def __call__(self, x):
        return self.transform(x), self.transform(x)

(a) Train a ResNet-18 backbone with SimCLR on CIFAR-10 (no labels) for 100 epochs. Use NT-Xent loss with temperature $\tau = 0.5$ and batch size 512. After training, freeze the backbone and train a linear probe on the labeled data. Report accuracy.

(b) Compare the SimCLR linear probe accuracy with: (1) a linear probe on an ImageNet-pretrained ResNet-18, (2) a ResNet-18 trained from scratch with supervision on CIFAR-10. Discuss which approach wins and why.

(c) Vary the temperature $\tau \in \{0.05, 0.1, 0.5, 1.0, 2.0\}$. How does temperature affect learned representations (measured by linear probe accuracy)? Explain the role of temperature mathematically: what happens to the gradient of the NT-Xent loss as $\tau \to 0$ and $\tau \to \infty$?

Exercise 13.9 (**)

The two-tower model uses in-batch negatives, which introduces popularity bias: popular items appear in more batches and receive more negative gradient signal, pushing them away from all users.

(a) Derive the expected number of times an item with engagement probability $p_i$ appears as a negative in one epoch, given batch size $B$ and $N$ total interactions. Show that popular items receive disproportionately more negative signal.

(b) Implement logQ correction (Yi et al., 2019), which compensates for the sampling bias by subtracting $\log p_i$ from the logit for item $i$:

$$s_{\text{corrected}}(q, k_j) = \frac{\mathbf{q}^\top \mathbf{k}_j}{\tau} - \log p_j$$

where $p_j$ is the empirical frequency of item $j$ in the training data. Implement this correction in the TwoTowerModel.forward method.

(c) Train the two-tower model with and without logQ correction on synthetic data where 10% of items account for 80% of interactions (Zipfian distribution). Compare the embedding space: does the correction lead to a more uniform distribution of items in the embedding space?

Exercise 13.10 (***)

Derive the connection between contrastive loss and mutual information.

(a) Starting from the InfoNCE loss:

$$\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\left[\log \frac{e^{f(\mathbf{x}, \mathbf{y}^+)}}{e^{f(\mathbf{x}, \mathbf{y}^+)} + \sum_{j=1}^{K} e^{f(\mathbf{x}, \mathbf{y}_j^-)}}\right]$$

show that the optimal critic $f^*(\mathbf{x}, \mathbf{y}) = \log \frac{p(\mathbf{y} \mid \mathbf{x})}{p(\mathbf{y})} + c$ for any constant $c$.

(b) Substitute the optimal critic into the InfoNCE loss and show that:

$$I(\mathbf{x}; \mathbf{y}) \geq \log(K + 1) - \mathcal{L}_{\text{InfoNCE}}$$

Explain why this bound tightens with more negatives $K$.

(c) This bound saturates at $\log(K+1)$. What does this mean practically for estimating mutual information with contrastive learning? If the true mutual information is 10 nats, how many negatives do you need for the bound to be non-trivial?

Exercise 13.11 (***)

Implement a two-tower model with hard negative mining for StreamRec.

(a) Define three types of negatives: - Easy negatives: Random items from the catalog. - Semi-hard negatives: Items from the same category as the positive but not engaged with. - Hard negatives: Items the user viewed but did not complete (started watching but abandoned).

Implement a HardNegativeSampler that returns a mix of all three types.

(b) Train the two-tower model with: (1) only easy negatives, (2) only hard negatives, (3) a 50/25/25 mix. Compare HR@10 and MRR on a held-out test set. Which strategy works best?

(c) Hard negative mining can lead to collapsed representations where the model learns to distinguish only the hardest negatives. Implement the semi-hard mining strategy from FaceNet (Schroff et al., 2015): select negatives that are harder than the positive but not too hard (i.e., negatives where $\mathbf{q}^\top \mathbf{k}^- < \mathbf{q}^\top \mathbf{k}^+$ but $\mathbf{q}^\top \mathbf{k}^-$ is within some margin of $\mathbf{q}^\top \mathbf{k}^+$). Does this prevent collapse?

Parameter-Efficient Fine-Tuning

Exercise 13.12 (**)

Implement LoRA for a pretrained BERT model:

(a) Apply LoRA (rank $r=8$, $\alpha=16$) to the query and value projection matrices in all attention layers of bert-base-uncased. How many trainable parameters does this add? Express as a fraction of the total model parameters.

(b) Fine-tune the LoRA model on SST-2 (Stanford Sentiment Treebank, binary classification) and compare accuracy with full fine-tuning. Use the same number of training steps and learning rate for both.

(c) After training, merge the LoRA weights into the base model: $W' = W + BA$. Verify that the merged model produces identical outputs to the LoRA model. What is the storage cost of saving only the LoRA weights vs. saving the full merged model?

Exercise 13.13 (**)

Compare three parameter-efficient fine-tuning methods on the same task (AG News text classification with 5,000 training examples):

(a) Implement and evaluate: - Full fine-tuning (all parameters) - LoRA (rank 8) on attention matrices - Adapter layers (bottleneck dimension 64) after each transformer block - Prompt tuning (20 learnable soft tokens prepended to input)

Report accuracy, number of trainable parameters, training time, and GPU memory usage for each.

(b) For each method, plot accuracy vs. number of training examples (100, 500, 1,000, 5,000). Which method is most data-efficient? Which degrades most gracefully with limited data?

(c) LoRA and adapters can be combined. Apply LoRA to attention matrices AND add adapter layers. Does the combination outperform either alone? Is the improvement worth the additional complexity?

Foundation Models and Embeddings

Exercise 13.14 (**)

Build an embedding-based semantic search engine for the StreamRec catalog.

(a) Generate 1,000 synthetic item descriptions (e.g., "Documentary about marine biology in the Pacific Ocean", "Comedy special about growing up in the Midwest"). Encode all descriptions using all-MiniLM-L6-v2 from sentence-transformers.

(b) Implement semantic search: given a natural language query ("I want to learn about space exploration"), retrieve the top-5 most similar items using cosine similarity. Verify that the results are semantically relevant.

(c) Compare the retrieval quality of three sentence embedding models: - all-MiniLM-L6-v2 (22M parameters, 384 dimensions) - all-mpnet-base-v2 (109M parameters, 768 dimensions) - BAAI/bge-large-en-v1.5 (335M parameters, 1024 dimensions)

Measure retrieval latency per query and embedding dimensionality. Is the quality improvement from larger models worth the latency cost for StreamRec's real-time serving requirements?

Exercise 13.15 (**)

Explore CLIP's zero-shot capabilities:

(a) Download 100 images spanning 10 categories from a dataset of your choice (e.g., CIFAR-10 test set or web images). Classify them using CLIP with class descriptions as prompts. Report accuracy.

(b) Prompt engineering matters. For the same images, compare three prompt templates: - "a photo of a {class}" - "{class}" - "a centered photo of a {class}, high quality"

Which template gives the best accuracy? Why does prompt format affect zero-shot performance?

(c) CLIP is trained on image-text pairs from the internet, which contain biases. Test CLIP's zero-shot performance across different demographic groups (e.g., classify the same activity performed by people of different apparent genders or ethnicities). Document any performance disparities.

Exercise 13.16 (***)

CLIP learns a joint embedding space for images and text. Investigate the geometry of this space:

(a) Encode 50 images and their textual descriptions using CLIP. Compute the cosine similarity matrix (50 x 50 for image-image, text-text, and image-text). Visualize all three matrices as heatmaps. Is the image-text alignment diagonal (matching pairs are most similar)?

(b) Perform linear interpolation between two image embeddings: $\mathbf{v}_\alpha = (1-\alpha)\mathbf{v}_1 + \alpha\mathbf{v}_2$ for $\alpha \in [0, 1]$. For each interpolated embedding, find the nearest text description. What concepts does the interpolation path traverse?

(c) Test the "modality gap" phenomenon (Liang et al., 2022): compute the mean image embedding and mean text embedding across a large set of paired data. Is there a systematic offset between the image and text embedding centroids? What are the implications for cross-modal retrieval?

Domain Adaptation

Exercise 13.17 (***)

Implement Domain-Adversarial Neural Networks (DANN) (Ganin et al., 2016) for unsupervised domain adaptation.

(a) The DANN architecture adds a domain classifier (with gradient reversal) to a standard classifier:

class GradientReversal(torch.autograd.Function):
    """Gradient reversal layer for domain-adversarial training."""

    @staticmethod
    def forward(ctx, x, alpha):
        ctx.alpha = alpha
        return x.clone()

    @staticmethod
    def backward(ctx, grad_output):
        return -ctx.alpha * grad_output, None

Implement the full DANN: shared feature extractor → (task classifier, domain classifier with gradient reversal). The feature extractor learns representations that are discriminative for the task but invariant to the domain.

(b) Test on the digits domain adaptation benchmark: train on MNIST (source), adapt to SVHN (target, no labels). Compare DANN with: (1) source-only training, (2) fine-tuning the source model on a small amount of SVHN labeled data. Report accuracy on the SVHN test set.

(c) DANN assumes a shared label space across domains. What happens when the source and target have different label distributions (label shift)? Propose and implement a modification that handles this case.

System Design and Production

Exercise 13.18 (**)

Design a model serving system for the StreamRec two-tower model that handles 10,000 queries per second with <50ms p99 latency.

(a) Draw a system architecture diagram showing: user tower inference, FAISS index, result caching, model update pipeline. Identify the bottleneck component.

(b) The item catalog is updated daily (100 new items, 500 metadata updates). Design a pipeline that recomputes affected item embeddings and updates the FAISS index without downtime.

(c) The user tower model needs to be updated weekly (retrained on new engagement data). Design a blue-green deployment strategy that validates the new model before switching traffic. What metrics would you monitor during the canary period?

Exercise 13.19 (**)

The two-tower model produces a fixed embedding for each item, but item relevance changes over time (seasonal content, trending topics).

(a) Propose a method to incorporate temporal signals into item embeddings without recomputing all embeddings daily. Consider: additive temporal features, time-aware projection layers, or exponential decay weighting.

(b) Implement a simple approach: concatenate a time embedding (day-of-week, hour-of-day, days-since-publish) to the item embedding before the projection head. Train and evaluate the time-aware model. Does it outperform the static model on a temporal test split?

(c) The StreamRec catalog has 200,000 items. What is the total FAISS index size (in GB) for embedding dimensions $d \in \{64, 128, 256, 512\}$ with float32 embeddings? At what embedding dimension does the index no longer fit in a single machine's RAM (assume 64 GB)?

Advanced Topics

Exercise 13.20 (***)

Implement model soups (Wortsman et al., 2022): average the weights of multiple fine-tuned models to produce a single model that outperforms any individual model.

(a) Fine-tune a pretrained ViT on CIFAR-10 five times with different random seeds (different data order, different head initialization). Record each model's test accuracy.

(b) Compute the uniform soup: average all five models' weights. Evaluate the soup. Does it outperform the individual models?

(c) Compute the greedy soup: start with the best individual model, then greedily add each remaining model to the average if it improves validation accuracy. Compare with the uniform soup. Does greedy selection matter?

Exercise 13.21 (***)

Derive the connection between contrastive learning and spectral clustering.

(a) Show that the optimal embeddings for the InfoNCE loss (with infinite negatives) are the eigenvectors of a normalized similarity kernel. Specifically, if $K(\mathbf{x}, \mathbf{y}) = p(\mathbf{x}, \mathbf{y}) / (p(\mathbf{x}) p(\mathbf{y}))$ is the pointwise mutual information kernel, the optimal $d$-dimensional embedding is given by the top $d$ eigenfunctions of $K$.

(b) This result connects contrastive learning to kernel PCA (Chapter 1) and spectral methods. What practical implication does this connection have for choosing the embedding dimension $d$?

(c) Use this connection to explain why contrastive learning performance improves with larger batch sizes: more negatives better approximate the true PMI kernel.

Exercise 13.22 (***)

Multi-task fine-tuning: instead of fine-tuning a pretrained model for a single task, fine-tune it for multiple tasks simultaneously.

(a) Fine-tune BERT on three tasks simultaneously: sentiment analysis (SST-2), natural language inference (MNLI), and paraphrase detection (QQP). Use a shared backbone with three separate classification heads. Compare with fine-tuning separately for each task.

(b) Implement task-specific LoRA: apply different LoRA adapters for each task while sharing the base model. Compare with: (1) shared LoRA across tasks, (2) full multi-task fine-tuning, (3) single-task fine-tuning.

(c) Multi-task learning can cause negative transfer between tasks (improving one task at the expense of another). Implement gradient vaccine (Wang et al., 2020): project task gradients to remove conflicting components before applying the update. Does this mitigate negative transfer?

Exercise 13.23 (****)

The scaling laws of transfer learning: How does fine-tuning performance scale with pretrained model size, target dataset size, and compute?

(a) Fine-tune BERT models of increasing size (TinyBERT-4L, DistilBERT-6L, BERT-base-12L, BERT-large-24L) on SST-2 with varying amounts of training data (100, 500, 1,000, 5,000, 25,000 examples). Plot accuracy vs. model size for each data level, and accuracy vs. data size for each model size.

(b) Fit a power law $A = c \cdot N^\alpha \cdot D^\beta$ where $N$ is model size, $D$ is dataset size, and $A$ is accuracy (or 1 - error rate). Estimate $\alpha$ and $\beta$. Is the relationship log-linear? Which exponent is larger — does model size or data size matter more for fine-tuning?

(c) Compare your scaling exponents with the scaling laws reported by Kaplan et al. (2020) for pretraining. Are the scaling laws for fine-tuning similar to those for pretraining? Propose a hypothesis for any observed differences.

Exercise 13.24 (****)

Self-supervised pretraining from scratch: PCRC has 500,000 unlabeled satellite images and 5,000 labeled images. Is it worth pretraining a ViT from scratch on the unlabeled data?

(a) Implement MAE (Masked Autoencoder) pretraining for a ViT-Small on the unlabeled satellite images. Train for 200 epochs with 75% masking ratio.

(b) Compare three strategies for the 5,000 labeled images: - Fine-tune an ImageNet-pretrained ViT - Fine-tune the MAE-pretrained ViT (from your satellite pretraining) - Fine-tune first on ImageNet, then MAE-pretrain on satellite data, then fine-tune on labeled data

Report accuracy for each. Does domain-specific pretraining help?

(c) The MAE pretraining took $X$ GPU-hours. Estimate the break-even point: how many labeled satellite images would PCRC need before the supervised-only approach (fine-tuning ImageNet ViT directly) matches the MAE-pretrained approach? Is the pretraining investment justified?

Exercise 13.25 (****)

Theoretical analysis of fine-tuning as regularized optimization.

(a) Formalize fine-tuning as the optimization problem:

$$\theta^* = \arg\min_\theta \mathcal{L}_{\text{task}}(\theta; \mathcal{D}_T) + \lambda R(\theta, \theta_0)$$

where $\theta_0$ are the pretrained weights and $R$ is a regularizer. Show that L2 regularization $R = \|\theta - \theta_0\|^2$ is equivalent to fine-tuning with weight decay toward pretrained weights (not toward zero).

(b) Derive the optimal solution for the case where $\mathcal{L}_{\text{task}}$ is quadratic (linear regression with pretrained feature extractor). Show that the solution interpolates between the pretrained weights and the task-optimal weights, with $\lambda$ controlling the interpolation.

(c) The Ben-David et al. (2010) bound (Section 13.3) suggests that the target error depends on source error + domain distance + ideal joint error. Reformulate this bound in terms of the fine-tuning optimization problem. Under what conditions on $R$ and $\lambda$ does the fine-tuned model minimize the bound?

Integrated Exercises

Exercise 13.26 (**)

End-to-end two-tower retrieval for a simplified StreamRec.

Generate synthetic data and train the full pipeline:

import numpy as np

def generate_streamrec_data(
    n_users: int = 5000,
    n_items: int = 2000,
    n_interactions: int = 50000,
    user_dim: int = 64,
    item_dim: int = 64,
    seed: int = 42,
) -> dict:
    """Generate synthetic StreamRec interaction data.

    Users and items have latent feature vectors. Interactions are
    generated based on feature similarity + noise.
    """
    rng = np.random.RandomState(seed)
    user_features = rng.randn(n_users, user_dim).astype(np.float32)
    item_features = rng.randn(n_items, item_dim).astype(np.float32)

    # Generate interactions from similarity
    interactions = []
    for _ in range(n_interactions):
        u = rng.randint(n_users)
        similarity = user_features[u] @ item_features.T
        probs = np.exp(similarity) / np.exp(similarity).sum()
        item = rng.choice(n_items, p=probs)
        interactions.append((u, item))

    return {
        "user_features": user_features,
        "item_features": item_features,
        "interactions": np.array(interactions),
    }

(a) Split interactions temporally (first 80% for training, last 20% for testing). Train the TwoTowerModel from Section 13.6. Report HR@10, HR@100, and MRR.

(b) Build a FAISS index for the trained item embeddings. Measure retrieval latency for 1,000 queries with IndexFlatIP vs. IndexIVFFlat. What speedup does the approximate index provide?

(c) Compare the two-tower model with the matrix factorization baseline from Chapter 1 and the MLP from Chapter 6 on the same data split. Which model performs best? At what data scale (increase n_interactions) do the approaches diverge?

Exercise 13.27 (***)

Build a multimodal retrieval system that uses CLIP to jointly embed item images and text descriptions.

(a) For 500 synthetic items, generate both a text description and a corresponding image (use torchvision datasets or generate simple synthetic images). Encode both modalities using CLIP. Compute the cross-modal similarity: does each image's nearest text neighbor match its description?

(b) Implement a fusion strategy: combine the CLIP image embedding and CLIP text embedding for each item into a single representation. Compare three fusion methods: - Concatenation: $\mathbf{e} = [\mathbf{v}; \mathbf{t}]$ - Average: $\mathbf{e} = (\mathbf{v} + \mathbf{t}) / 2$ - Learned projection: $\mathbf{e} = W[\mathbf{v}; \mathbf{t}] + \mathbf{b}$ (train $W, \mathbf{b}$ with contrastive loss)

Evaluate each on a retrieval task.

(c) A user searches for "relaxing nature documentary with beautiful scenery." Retrieve items using: (1) text-only similarity, (2) image-only similarity, (3) multimodal fusion. Which modality is most useful for this query? Propose a query-adaptive weighting scheme.

Exercise 13.28 (***)

Curriculum fine-tuning: Fine-tune a pretrained model by presenting training examples in order of difficulty, from easiest to hardest.

(a) Define difficulty for the satellite imagery classification task using three metrics: - Loss-based: Sort examples by loss under the pretrained model (low loss = easy). - Distance-based: Sort by distance from class centroid in the pretrained feature space. - Entropy-based: Sort by prediction entropy (low entropy = easy).

(b) Implement curriculum fine-tuning: train for 10 epochs, presenting the easiest 10% of examples in epoch 1, easiest 20% in epoch 2, and so on until all examples are included in epoch 10. Compare with standard fine-tuning (random order) and anti-curriculum (hardest first).

(c) Does curriculum learning provide more benefit with smaller datasets? Repeat the experiment with 500, 2,000, and 8,000 labeled images. Plot the accuracy gap between curriculum and standard fine-tuning as a function of dataset size.