Chapter 28: Exercises — Multimodal Models and Vision-Language AI

Conceptual Exercises

Exercise 1: CLIP Contrastive Loss

For a batch of $N = 4$ image-text pairs, the cosine similarity matrix is:

$$S = \begin{bmatrix} 0.9 & 0.2 & 0.1 & 0.3 \\ 0.1 & 0.8 & 0.4 & 0.2 \\ 0.3 & 0.3 & 0.7 & 0.1 \\ 0.2 & 0.1 & 0.2 & 0.85 \end{bmatrix}$$

Compute the image-to-text loss $\mathcal{L}_{\text{image}}$ (first row = image 1 matching to all texts). Assume temperature $\tau = 1$.

Exercise 2: Zero-Shot Classification Mechanism

Explain why CLIP's zero-shot classification works without any training on the target classes. How does the contrastive pre-training objective naturally enable this capability? What role does prompt engineering play?

Exercise 3: Batch Size in Contrastive Learning

Why does CLIP require very large batch sizes (32,768) during training? What happens to the quality of the learned representations if you train with a batch size of 256 instead? Propose a method to mitigate the effect of smaller batch sizes.

Exercise 4: LLaVA Architecture Analysis

LLaVA connects a CLIP ViT-L/14 (1024-dim) to Vicuna-7B (4096-dim) with a linear projection. (a) What is the shape of the projection matrix? (b) How many visual tokens does the LLM receive? (c) What is the total sequence length if the text prompt has 50 tokens? (d) Why is the vision encoder frozen during Stage 2 training?

Exercise 5: BLIP-2 Q-Former

Explain the role of the Q-Former in BLIP-2. Why is a fixed number of learned queries (32) used rather than passing all image patch features to the LLM? What are the computational implications?

Exercise 6: Flamingo's Gated Cross-Attention

In Flamingo, the gated cross-attention output is scaled by $\tanh(\alpha) \cdot \text{CrossAttn}(\mathbf{x}, \mathbf{v})$, where $\alpha$ is a learnable scalar initialized to zero. (a) What is the output at initialization? (b) Why is this initialization important? (c) How does this differ from ControlNet's zero initialization strategy?

Exercise 7: Multimodal Embedding Arithmetic

Given CLIP embeddings, explain how you would implement: (a) text-to-image retrieval, (b) image-to-image retrieval, (c) "image + text offset" retrieval (e.g., an image of a cat + "wearing a hat"). What limitations does embedding arithmetic have?

Exercise 8: VQA Challenge Analysis

Consider the question "How many red cars are in the image?" about a scene with 3 red cars, 2 blue cars, and 1 red truck. (a) Why is this question challenging for CLIP-based zero-shot approaches? (b) How would LLaVA handle this differently? (c) What visual reasoning capabilities are required?

Exercise 9: Contrastive vs. Generative Objectives

Compare the contrastive objective (CLIP) with the generative objective (image captioning). What does each objective learn well? What does each fail to capture? How does BLIP combine both for improved representations?

Exercise 10: ImageBind Emergent Alignment

ImageBind aligns 6 modalities through images, but audio and depth are never directly paired during training. Explain how alignment between audio and depth emerges. Under what conditions might this transitive alignment fail?

Implementation Exercises

Exercise 11: CLIP Feature Extraction

Write a PyTorch script that loads CLIP ViT-B/32 and extracts image and text embeddings. Compute the similarity matrix for 5 images and 5 text descriptions, and visualize it as a heatmap.

Exercise 12: Zero-Shot Classifier

Implement a zero-shot image classifier using CLIP. Test on CIFAR-10 with multiple prompt templates ("a photo of a {class}", "a {class}", "an image of a {class}") and compare accuracies.

Exercise 13: Prompt Ensemble for Classification

Implement prompt ensembling for zero-shot classification: average the text embeddings from multiple prompt templates for each class. Measure the accuracy improvement over a single template on CIFAR-100.

Exercise 14: Image Captioning Pipeline

Build an image captioning pipeline using BLIP-2 from Hugging Face. Caption 20 images from different domains and evaluate quality using both CLIPScore and manual inspection.

Exercise 15: Visual Question Answering

Implement a VQA pipeline using BLIP-2 or LLaVA. Test on at least 20 diverse questions spanning counting, color, spatial relationships, and general knowledge.

Build a simple image search engine: encode a folder of 100+ images with CLIP, store embeddings in a NumPy array, and implement text-to-image and image-to-image search with cosine similarity.

Exercise 17: FAISS Index Building

Extend Exercise 16 by storing CLIP embeddings in a FAISS index. Compare search speed and accuracy between a flat index (exact) and an IVF index (approximate) for 10,000+ images.

Exercise 18: Cross-Modal Retrieval Evaluation

Evaluate your retrieval system using Recall@K (K=1, 5, 10) on the Flickr30k dataset. Report both text-to-image and image-to-text retrieval performance.

Exercise 19: LLaVA Conversation Interface

Build an interactive conversation interface for LLaVA where users can upload an image and ask multiple follow-up questions. Implement conversation history management.

Exercise 20: Multimodal Embedding Visualization

Extract CLIP embeddings for 1000 images from 10 categories and their text descriptions. Visualize both image and text embeddings in 2D using t-SNE, coloring by category.

Applied Exercises

Exercise 21: Product Search System

Build a product search system: index a catalog of product images with CLIP, allow text queries ("red running shoes"), and return ranked results. Implement filtering by category metadata.

Exercise 22: Image Moderation

Use CLIP zero-shot classification to build an image content moderation system. Define categories (safe, suggestive, violent, etc.) and classify a test set of images. Measure precision and recall.

Exercise 23: Medical Image QA

Fine-tune a VQA model on a medical image dataset (e.g., PathVQA). Compare zero-shot performance with fine-tuned performance and analyze failure cases.

Exercise 24: Multimodal RAG

Build a Retrieval-Augmented Generation system for images: given a text query, retrieve the top-3 most relevant images from a database and generate a summary using a multimodal LLM.

Exercise 25: CLIP Linear Probe

Train a linear classifier on frozen CLIP features for a custom dataset (at least 5 classes, 50+ images per class). Compare accuracy with (a) CLIP zero-shot, (b) linear probe, and (c) fine-tuned CLIP.

Challenge Exercises

Exercise 26: CLIP Training from Scratch

Implement CLIP's contrastive training objective and train a small CLIP model (ResNet-18 + 4-layer transformer) on a subset of Conceptual Captions (100K pairs). Evaluate zero-shot performance on CIFAR-10.

Exercise 27: Custom Multimodal Model

Build a simplified LLaVA-style model: connect a pre-trained CLIP image encoder to a small GPT-2 model through a learned projection. Train on a captioning dataset and evaluate.

Exercise 28: Grounded VQA

Implement a VQA system that not only answers questions but also highlights the image region relevant to the answer using attention visualization. Evaluate on GQA.

Exercise 29: Cross-Lingual Image Retrieval

Using a multilingual CLIP model, build a system that can retrieve images using queries in multiple languages. Test on English, Spanish, and Chinese queries and compare retrieval quality.

Exercise 30: Multimodal Few-Shot Learning

Implement a few-shot learning system that can classify new image categories from just 1-5 examples, using CLIP embeddings and a simple nearest-neighbor or prototypical approach.