Chapter 28: Quiz — Multimodal Models and Vision-Language AI

Test your understanding of multimodal models with these questions. Try to answer each question before revealing the solution.

Question 1

How does CLIP's contrastive learning objective create a shared embedding space for images and text?

Show Answer

CLIP trains separate image and text encoders to produce embeddings in a shared $d$-dimensional space. The contrastive objective maximizes the cosine similarity between matched image-text pairs while minimizing similarity between unmatched pairs within each batch. For a batch of $N$ pairs, the loss treats each image as a query that should match its paired text (image-to-text softmax cross-entropy) and each text as a query that should match its paired image (text-to-image softmax cross-entropy). Over training on 400M pairs, the encoders learn to map semantically related concepts from both modalities to nearby points in the shared space, enabling direct comparison of images and text through cosine similarity.

Question 2

Why does CLIP achieve strong zero-shot classification without seeing any labeled examples from the target dataset?

Show Answer

CLIP achieves zero-shot classification because its contrastive pre-training on 400M diverse image-text pairs from the internet exposes it to an enormous vocabulary of visual concepts described in natural language. At inference time, class names are converted to text prompts ("a photo of a {class}"), encoded by the text encoder, and compared to the image embedding via cosine similarity. The class with the highest similarity is predicted. This works because: (1) the shared embedding space already contains representations for most common visual concepts, (2) the natural language interface is flexible — any concept expressible in text can serve as a classifier, and (3) the training data's diversity covers a wide range of domains and categories that overlap with common benchmarks.

Question 3

What are the limitations of CLIP for tasks requiring compositional understanding?

Show Answer

CLIP struggles with compositional understanding in several ways: (1) **Attribute binding**: CLIP often fails to correctly associate attributes with objects — "a red cube and a blue sphere" may be confused with "a blue cube and a red sphere." (2) **Counting**: CLIP cannot reliably count objects. (3) **Spatial relationships**: Understanding "above," "below," "left of" is weak. (4) **Negation**: CLIP handles "no" or "not" poorly in text. (5) **Order sensitivity**: The bag-of-words-like behavior means "dog chases cat" and "cat chases dog" may produce similar embeddings. These limitations arise because the contrastive objective operates on global embeddings that compress all information into a single vector, losing fine-grained structural details. Benchmarks like Winoground and ARO specifically test these compositional failures.

Question 4

Describe the two-stage training process of LLaVA and explain why each stage is necessary.

Show Answer

**Stage 1: Feature Alignment Pre-training** trains only the projection layer (linear or MLP) that maps CLIP visual features to the LLM's embedding space, using 595K image-caption pairs. Both the vision encoder and LLM are frozen. This stage is necessary to establish a "translation" between the visual feature space and the text token space — without it, the LLM would receive unintelligible inputs from the vision encoder. **Stage 2: Visual Instruction Tuning** unfreezes the projection layer and the LLM (vision encoder stays frozen), training on 158K multimodal instruction-following examples. This stage is necessary to teach the model to follow complex multimodal instructions — answering questions, describing details, engaging in multi-turn conversation about images. Stage 1 alone only teaches captioning; Stage 2 teaches instruction following, reasoning, and conversational behavior.

Question 5

How does the Q-Former in BLIP-2 bridge a frozen image encoder and a frozen LLM?

Show Answer

The Q-Former is a lightweight transformer with a set of 32 learnable query tokens. It operates as a bottleneck between the vision encoder and LLM. The queries cross-attend to the frozen image encoder's output features, extracting the most relevant visual information into a fixed set of 32 visual tokens. These compressed visual tokens are then projected to the LLM's input dimension and prepended to the text tokens as a visual prefix. The Q-Former is the only component trained, with only ~188M parameters. This design is parameter-efficient (the much larger image encoder and LLM remain frozen) and computationally efficient (32 visual tokens is much fewer than the 257 tokens from a ViT-G, reducing the LLM's input length and attention cost).

Question 6

Compare the approaches of LLaVA and Flamingo for integrating vision into language models.

Show Answer

**LLaVA** uses a simple approach: visual tokens from CLIP are projected to the LLM's embedding space and concatenated with text tokens in the input sequence. The LLM processes all tokens uniformly through its standard self-attention. Training unfreezes the projection layer and LLM. **Flamingo** uses a more architectural approach: visual tokens from a Perceiver Resampler (which compresses image features into 64 tokens) are injected into the frozen LLM through newly added gated cross-attention layers between existing LLM layers. Only the Perceiver Resampler and cross-attention layers are trained; the LLM remains frozen. Key differences: (1) Flamingo keeps the LLM frozen (preserving language capabilities), while LLaVA fine-tunes it. (2) Flamingo uses cross-attention (visual tokens are separate from text tokens), while LLaVA concatenates them. (3) Flamingo natively handles interleaved image-text sequences for few-shot learning, while LLaVA processes single images.

Question 7

What is the role of the temperature parameter in CLIP's contrastive loss?

Show Answer

The temperature parameter $\tau$ scales the cosine similarity scores before the softmax: $S_{ij} = \cos(\mathbf{v}_i, \mathbf{u}_j) / \tau$. A lower temperature sharpens the softmax distribution, making the model more confident about correct matches and more harshly penalizing incorrect matches. A higher temperature produces softer distributions, treating all pairs more equally. CLIP initializes $\tau = 0.07$ and learns it during training. The learned temperature typically converges to a small value (~0.01), indicating that the model benefits from sharp similarity distributions. This is important because without proper temperature scaling, all cosine similarities (which range from -1 to 1) would produce a nearly uniform softmax distribution, making the contrastive loss ineffective.

Question 8

How does ImageBind achieve alignment between modalities that were never directly paired during training?

Show Answer

ImageBind leverages images as a "binding" modality. During training, each non-image modality (audio, depth, thermal, IMU, text) is paired with images using CLIP-style contrastive learning. For example, audio-image pairs train the audio encoder, and depth-image pairs train the depth encoder. Because all modalities are aligned to images (and images are aligned to text through CLIP), alignment between other modality pairs emerges transitively. If "dog barking" audio is close to "dog" images, and "dog" images are close to "dog" text, then "dog barking" audio will be close to "dog" text — even though audio and text were never directly paired. This works because the shared image embedding space acts as a hub, creating an implicit alignment between all modalities through their shared relationship to images.

Question 9

What metrics are used to evaluate image captioning, and what are their strengths and weaknesses?

Show Answer

**BLEU**: Measures n-gram precision between generated and reference captions. Strengths: fast, widely used. Weaknesses: ignores recall; penalizes valid paraphrases; does not capture semantic meaning. **METEOR**: Uses synonym matching and stemming. Strengths: better handles paraphrasing than BLEU. Weaknesses: still largely lexical; relies on WordNet. **CIDEr**: TF-IDF weighted n-gram similarity, designed for captioning. Strengths: emphasizes informative words; uses multiple references well. Weaknesses: can be gamed by including rare words. **SPICE**: Parses captions into scene graphs and compares semantic content. Strengths: evaluates semantic correctness (objects, attributes, relations). Weaknesses: computationally expensive; limited by parser accuracy. **CLIPScore**: Uses CLIP to measure image-caption alignment directly. Strengths: reference-free; captures semantic similarity. Weaknesses: inherits CLIP's limitations (compositionality, counting). No single metric is sufficient; best practice is to report multiple metrics and include human evaluation.

Question 10

Explain how a multimodal retrieval system using CLIP and FAISS works end-to-end.

Show Answer

**Indexing phase**: (1) Load all images in the database. (2) Process each image through CLIP's image encoder to produce a normalized embedding vector. (3) Collect all embeddings into a matrix and build a FAISS index (e.g., IndexFlatIP for exact search or IndexIVFPQ for approximate search). The index enables fast nearest-neighbor lookup. **Query phase**: (1) Receive a text query from the user. (2) Encode the query using CLIP's text encoder to produce a normalized embedding. (3) Search the FAISS index for the $K$ nearest neighbors by inner product (equivalent to cosine similarity for normalized vectors). (4) Return the corresponding images ranked by similarity score. For image-to-image search, the query image is encoded with the image encoder instead of the text encoder. For hybrid queries (image + text modification), embeddings can be combined through addition or more sophisticated fusion methods.

Question 11

Why is contrastive learning with large batch sizes critical for CLIP's performance?

Show Answer

Large batch sizes are critical because the contrastive loss treats all non-matching pairs in the batch as negative examples. With batch size $N$, each positive pair has $N-1$ negative examples. Larger batches provide more diverse and challenging negatives, preventing the model from taking shortcuts (e.g., distinguishing images and texts based on superficial features that happen to not appear in a small batch). With batch size 256, the model sees 255 negatives — many of which may be easy to distinguish. With batch size 32,768, the model sees 32,767 negatives, which are much more likely to include hard negatives that require genuine semantic understanding. Empirically, CLIP's zero-shot performance degrades significantly with smaller batch sizes. Techniques like memory banks, momentum encoders (MoCo), and gradient caching can partially mitigate this by maintaining a large pool of negatives without requiring enormous batch sizes.

Question 12

What challenges does visual question answering pose beyond image captioning?

Show Answer

VQA poses several additional challenges: (1) **Targeted understanding**: While captioning requires general description, VQA demands answering specific questions that may focus on small details, attributes, or relationships. (2) **Counting**: "How many?" questions require precise enumeration. (3) **Spatial reasoning**: Questions about relative positions ("What is to the left of the dog?") require spatial understanding. (4) **World knowledge**: OK-VQA questions ("What country's flag is this?") require knowledge beyond the image. (5) **Reading comprehension**: TextVQA requires OCR capabilities to read text in images. (6) **Multi-step reasoning**: Some questions require chaining multiple observations ("Is the person who is standing taller than the person who is sitting?"). (7) **Language bias**: Models can learn to answer based on question patterns alone (e.g., "What color is the banana?" -> "yellow") without truly understanding the image.

Question 13

How does SigLIP improve upon CLIP's training efficiency?

Show Answer

SigLIP replaces CLIP's softmax-based contrastive loss (which requires computing similarities across all pairs in a batch) with a sigmoid loss that operates on individual image-text pairs independently. Instead of the cross-entropy over the full $N \times N$ similarity matrix, SigLIP applies a binary cross-entropy to each pair: positive pairs should have high similarity, and negative pairs should have low similarity. This has two key advantages: (1) **Memory efficiency**: No need to gather all embeddings across devices for the full similarity matrix, reducing communication overhead in distributed training. (2) **Smaller effective batch size**: The loss does not require all-pairs comparison, so it can achieve good results with smaller batch sizes. SigLIP achieves comparable or better zero-shot performance than CLIP while being more efficient to train, especially in multi-GPU settings.