41 min read

> "The world is not made of text alone." — A guiding principle for AI engineering

In This Chapter

28.1 The Multimodal Challenge
28.2 CLIP: Contrastive Language-Image Pre-training
28.3 Open-Vocabulary Detection and Segmentation
28.4 Image Captioning
28.5 Visual Question Answering (VQA)
28.6 LLaVA: Large Language and Vision Assistant
28.7 Flamingo and In-Context Multimodal Learning
28.8 Multimodal Embeddings
28.9 Building a Multimodal Retrieval System
28.10 Advanced Topics in Multimodal AI
28.11 Practical Considerations
28.12 Summary
28.13 Exercises
References

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 28: Multimodal Models and Vision-Language AI

"The world is not made of text alone." — A guiding principle for AI engineering

Humans perceive and reason about the world through multiple sensory channels simultaneously — we see an image, read its caption, and understand the relationship between them effortlessly. For decades, computer vision and natural language processing developed as separate disciplines with distinct architectures, datasets, and evaluation protocols. The emergence of multimodal models has shattered this barrier, creating systems that jointly understand images and text, answer questions about visual content, and retrieve relevant images from natural language queries.

This chapter explores the foundations and frontiers of vision-language AI. We begin with CLIP, the model that demonstrated the power of contrastive learning across modalities, then examine architectures for image captioning and visual question answering (VQA), and culminate with modern multimodal large language models like LLaVA and Flamingo. By the end, you will understand how to build systems that bridge the gap between seeing and speaking, with practical implementations in PyTorch and Hugging Face.

The core technical ideas in this chapter build on foundations from earlier parts of the book. The vision encoders rely on the transformer architectures covered in Chapter 26, the language decoders use the autoregressive generation techniques from Chapters 7-9, and the contrastive learning objectives share deep connections with the embedding methods discussed in Chapter 12. Multimodal AI also feeds forward into later chapters: the audio-language alignment in Chapter 29 follows the same contrastive blueprint as CLIP, and the video-language models in Chapter 30 extend these ideas to the temporal domain.

28.1 The Multimodal Challenge

28.1.1 Why Multimodality Matters

Single-modality models are inherently limited. A text-only model cannot verify whether a claim about an image is true. A vision-only model cannot explain what it sees in natural language. Real-world AI applications — from medical image analysis with clinical reports to autonomous driving with voice commands — require understanding that spans modalities.

The core technical challenge is alignment: learning a shared representation space where semantically related concepts from different modalities are close together. A picture of a golden retriever and the text "a golden retriever playing in the park" should map to nearby points in this shared space, despite being radically different data types.

28.1.2 The Alignment Problem

The central technical challenge of multimodal AI is alignment: learning correspondences between concepts expressed in different modalities. Alignment operates at multiple granularities:

Global alignment: The overall meaning of an image matches the overall meaning of its caption. CLIP operates at this level.
Regional alignment: Specific image regions correspond to specific words or phrases. "The red car" should activate features near the red car in the image. Grounding models operate at this level.
Temporal alignment: In video-text settings, specific video segments correspond to specific sentences. Whisper timestamps (Chapter 29) operate at this level.

The difficulty increases with granularity. Global alignment can be learned with contrastive objectives; regional alignment requires attention mechanisms or detection heads; temporal alignment requires sequence modeling.

28.1.3 Historical Approaches

Early multimodal systems used a two-stage approach: 1. Extract features independently from each modality using pre-trained models (e.g., ResNet for images, BERT for text). 2. Fuse features using a learned fusion module (concatenation, bilinear pooling, or attention).

This approach suffered from a fundamental problem: the visual and textual features were trained on different objectives and existed in incompatible representation spaces. Alignment between modalities had to be learned entirely by the fusion module, which was typically trained on small, task-specific datasets.

Specific fusion strategies included: - Early fusion: Concatenate raw features before processing. Simple but requires the model to learn alignment from scratch. - Late fusion: Process each modality independently and combine only the final predictions. Misses cross-modal interactions. - Cross-attention fusion: Use attention to dynamically weight features from one modality based on the other. Most flexible but computationally expensive.

The breakthrough came when researchers realized that if the features from both modalities were pre-aligned — trained to exist in the same embedding space — the fusion problem became trivially easy.

28.1.4 The Contrastive Learning Revolution

Before CLIP, the dominant approach to visual representation learning was supervised classification on ImageNet. A model trained to classify 1,000 ImageNet categories learned features that transferred well to other vision tasks, but these features had no inherent connection to language. CLIP changed this fundamentally by training on a different objective: rather than predicting fixed categories, the model learned to predict which text caption belonged to which image. This seemingly simple change had profound consequences.

CLIP (Radford et al., 2021) demonstrated that pre-training on hundreds of millions of image-text pairs with a contrastive objective produces representations that are aligned by construction. This eliminated the need for task-specific fusion and enabled zero-shot transfer to a wide range of vision tasks. The key insight was that natural language provides a much richer supervisory signal than class labels — a caption like "a golden retriever playing fetch on a sunny beach" contains information about the object (golden retriever), the action (playing fetch), the setting (beach), and the conditions (sunny) that no single class label could capture.

This approach also naturally scales with the diversity of internet data. While ImageNet has 1,000 categories, the web contains descriptions of millions of visual concepts. By learning from this diversity, CLIP acquired visual understanding that generalized far beyond any fixed category set.

28.2 CLIP: Contrastive Language-Image Pre-training

CLIP (Contrastive Language-Image Pre-training) is arguably the most influential multimodal model of the 2020s. Its influence extends far beyond its original purpose of image classification: CLIP embeddings power text-to-image generation in Stable Diffusion (Chapter 27), serve as the visual backbone in LLaVA and other multimodal LLMs, enable open-vocabulary object detection, and provide the foundation for audio-text models like CLAP (Chapter 29) and video-text models like CLIP4Clip (Chapter 30). Understanding CLIP deeply is therefore essential for understanding the entire modern multimodal AI ecosystem.

28.2.1 Architecture Overview

CLIP consists of two encoders trained jointly:

Image encoder: Either a Vision Transformer (ViT-B/32, ViT-B/16, ViT-L/14) or a modified ResNet (RN50, RN101, RN50x4, RN50x16, RN50x64). The image encoder produces a fixed-dimensional vector $\mathbf{v} = f_{\text{image}}(\mathbf{x}) \in \mathbb{R}^d$.
Text encoder: A 12-layer Transformer with masked self-attention (GPT-2-style). The text encoder produces a fixed-dimensional vector $\mathbf{u} = f_{\text{text}}(\mathbf{t}) \in \mathbb{R}^d$.

Both encoders project their outputs to a shared embedding space of dimension $d$ (512 or 768 depending on the model variant) through learned linear projections followed by L2 normalization. The normalization ensures that all embeddings lie on a unit hypersphere, making cosine similarity equivalent to dot product and enabling efficient similarity computation.

28.2.2 The Contrastive Objective

CLIP is trained on a batch of $N$ image-text pairs. Within each batch, there are $N$ correct pairings and $N^2 - N$ incorrect pairings. The training objective maximizes the cosine similarity of correct pairs while minimizing the similarity of incorrect pairs.

Given a batch of image embeddings $\{\mathbf{v}_i\}_{i=1}^N$ and text embeddings $\{\mathbf{u}_j\}_{j=1}^N$, the similarity matrix is:

$$S_{ij} = \tau \cdot \cos(\mathbf{v}_i, \mathbf{u}_j) = \tau \cdot \frac{\mathbf{v}_i \cdot \mathbf{u}_j}{\|\mathbf{v}_i\| \|\mathbf{u}_j\|}$$

where $\tau$ is a learned temperature parameter (initialized to 0.07 and learned during training).

The loss is a symmetric cross-entropy over the similarity matrix:

$$\mathcal{L} = \frac{1}{2}\left(\mathcal{L}_{\text{image}} + \mathcal{L}_{\text{text}}\right)$$

where:

$$\mathcal{L}_{\text{image}} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(S_{ii})}{\sum_{j=1}^{N}\exp(S_{ij})}$$

$$\mathcal{L}_{\text{text}} = -\frac{1}{N}\sum_{j=1}^{N} \log \frac{\exp(S_{jj})}{\sum_{i=1}^{N}\exp(S_{ij})}$$

$\mathcal{L}_{\text{image}}$ treats each image as a query and its paired text as the positive, while $\mathcal{L}_{\text{text}}$ treats each text as a query and its paired image as the positive.

28.2.3 Understanding the Contrastive Loss

The intuition behind the contrastive loss is worth developing carefully, because the same objective appears across multimodal AI — in CLAP for audio-text alignment (Chapter 29), in VideoCLIP for video-text alignment (Chapter 30), and in various retrieval systems.

Imagine a batch of $N = 4$ image-text pairs: (cat image, "a cat"), (dog image, "a dog"), (car image, "a car"), (tree image, "a tree"). The similarity matrix $\mathbf{S} \in \mathbb{R}^{4 \times 4}$ has the correct pairings along the diagonal. The loss pushes diagonal entries to be high and off-diagonal entries to be low.

Why the temperature $\tau$ matters: The temperature controls the "sharpness" of the softmax distribution. A small $\tau$ (e.g., 0.01) makes the softmax very peaked, strongly penalizing any mismatch, but can lead to training instability. A large $\tau$ (e.g., 1.0) produces softer distributions that are easier to optimize but provide weaker learning signal. CLIP learns $\tau$ as a parameter, initialized at $1/0.07 \approx 14.3$, and typically converges to values around $1/0.01 = 100$, indicating that tight discrimination is optimal.

Why large batch sizes are critical: With $N$ image-text pairs per batch, each positive pair has $N - 1$ negative examples. With $N = 32{,}768$, each image must be distinguished from 32,767 incorrect captions, and each caption from 32,767 incorrect images. This enormous number of negatives provides a rich contrastive signal. With a small batch (e.g., $N = 256$), many negatives are too "easy" — semantically unrelated — providing little gradient. Large batches ensure the presence of "hard negatives" (e.g., another dog image for the text "a dog") that force the model to learn fine-grained distinctions.

Worked Example — Computing the loss: Consider a simplified batch of $N = 3$. After encoding and normalizing, suppose the similarity matrix (after temperature scaling) is:

$$\mathbf{S} = \begin{bmatrix} 5.0 & 1.2 & 0.3 \\ 0.8 & 4.5 & 0.5 \\ 0.1 & 0.7 & 4.8 \end{bmatrix}$$

The image-to-text loss for image 1 is: $-\log \frac{e^{5.0}}{e^{5.0} + e^{1.2} + e^{0.3}} = -\log \frac{148.4}{148.4 + 3.32 + 1.35} = -\log(0.970) = 0.031$. This is small because the model correctly assigns the highest similarity to the paired text. Computing similar terms for all rows and columns and averaging gives the total loss.

28.2.4 Training Scale

CLIP was trained on 400 million image-text pairs collected from the internet (the WIT dataset — WebImageText). Training used: - Batch size: 32,768 (large batches provide more negative examples, which is critical for contrastive learning) - 32 epochs over the dataset - Distributed across hundreds of GPUs - Mixed precision training with gradient checkpointing

The scale of training is essential: CLIP's zero-shot capabilities emerge from exposure to an enormous diversity of visual concepts paired with natural language descriptions.

OpenCLIP, the open-source reproduction of CLIP, was subsequently trained on the LAION-2B dataset (2 billion image-text pairs) and matched or exceeded the original CLIP's performance. The LAION-5B dataset, comprising 5.85 billion image-text pairs, has become the standard pre-training dataset for open-source multimodal models, though it has faced scrutiny for content quality and ethical concerns.

28.2.4 Zero-Shot Classification

CLIP enables zero-shot image classification without any task-specific training. The procedure is:

Define class names as text prompts: "a photo of a {class}"
Encode all prompts: $\mathbf{u}_c = f_{\text{text}}(\text{"a photo of a } c\text{"})$ for each class $c$
Encode the image: $\mathbf{v} = f_{\text{image}}(\mathbf{x})$
Predict the class with highest similarity: $\hat{c} = \arg\max_c \cos(\mathbf{v}, \mathbf{u}_c)$

Prompt engineering significantly affects zero-shot performance. Instead of just using the class name, templates like "a photo of a {class}, a type of pet" or ensembling multiple templates improves accuracy by 3-5 percentage points.

CLIP ViT-L/14@336px achieves 76.2% zero-shot accuracy on ImageNet — competitive with a fully supervised ResNet-50 (76.1%) despite never seeing a single ImageNet training example.

28.2.5 CLIP Model Variants and Performance

CLIP was released in several configurations, each offering a different tradeoff between accuracy and computational cost:

Model	Image Encoder	Embed Dim	ImageNet Zero-Shot	Params
RN50	ResNet-50	1024	58.2%	102M
ViT-B/32	ViT-Base, 32px patches	512	63.3%	151M
ViT-B/16	ViT-Base, 16px patches	512	68.3%	150M
ViT-L/14	ViT-Large, 14px patches	768	75.5%	428M
ViT-L/14@336	ViT-Large at 336px	768	76.2%	428M

The progression demonstrates two trends: (1) Vision Transformers consistently outperform ResNets of similar size in the CLIP framework, confirming the findings from Chapter 26 about ViT's scalability, and (2) smaller patch sizes and higher input resolutions significantly improve performance. The ViT-L/14@336px model remains the most commonly used CLIP variant for downstream applications due to its strong balance of quality and efficiency.

OpenCLIP subsequently trained larger models on LAION-2B, with OpenCLIP ViT-bigG/14 achieving 80.1% zero-shot accuracy on ImageNet — demonstrating that scaling both the model and the training data continues to improve performance.

28.2.6 CLIP's Representation Space

The shared embedding space learned by CLIP has remarkable properties:

Linear separability: CLIP features enable simple linear classifiers to achieve strong performance on many downstream tasks.
Compositional understanding: The model can distinguish "a dog on a cat" from "a cat on a dog" (though not perfectly).
Domain generalization: CLIP features transfer well to distribution shifts — it significantly outperforms ImageNet-trained models on ImageNet-V2, ImageNet-Sketch, and ImageNet-R.

28.2.7 Practical CLIP Usage with HuggingFace

CLIP is remarkably easy to use for zero-shot classification and retrieval:

import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image


def clip_zero_shot_classify(
    image_path: str,
    candidate_labels: list[str],
    model_name: str = "openai/clip-vit-base-patch32",
) -> dict[str, float]:
    """Perform zero-shot classification using CLIP.

    Args:
        image_path: Path to the input image.
        candidate_labels: List of candidate class names.
        model_name: HuggingFace CLIP model identifier.

    Returns:
        Dict mapping label names to probability scores.
    """
    processor = CLIPProcessor.from_pretrained(model_name)
    model = CLIPModel.from_pretrained(model_name)

    image = Image.open(image_path)
    prompts = [f"a photo of a {label}" for label in candidate_labels]

    inputs = processor(
        text=prompts,
        images=image,
        return_tensors="pt",
        padding=True,
    )

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits_per_image  # [1, num_labels]
    probs = logits.softmax(dim=-1).squeeze()

    results = {}
    for label, prob in zip(candidate_labels, probs):
        results[label] = prob.item()
    return results


def clip_image_text_similarity(
    image_path: str,
    texts: list[str],
    model_name: str = "openai/clip-vit-base-patch32",
) -> list[float]:
    """Compute CLIP similarity between an image and multiple texts.

    Args:
        image_path: Path to the input image.
        texts: List of text strings to compare.
        model_name: HuggingFace CLIP model identifier.

    Returns:
        List of cosine similarity scores.
    """
    processor = CLIPProcessor.from_pretrained(model_name)
    model = CLIPModel.from_pretrained(model_name)

    image = Image.open(image_path)
    inputs = processor(
        text=texts, images=image,
        return_tensors="pt", padding=True,
    )

    with torch.no_grad():
        image_features = model.get_image_features(
            pixel_values=inputs["pixel_values"]
        )
        text_features = model.get_text_features(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
        )

    # Normalize features
    image_features = image_features / image_features.norm(
        dim=-1, keepdim=True
    )
    text_features = text_features / text_features.norm(
        dim=-1, keepdim=True
    )

    # Cosine similarity
    similarities = (image_features @ text_features.T).squeeze()
    return similarities.tolist()

Prompt engineering for CLIP: The choice of text template significantly affects zero-shot performance. OpenAI found that using ensembles of 80 different prompt templates (e.g., "a photo of a {class}", "a blurry photo of a {class}", "a black and white photo of a {class}", "a photo of the large {class}") and averaging the resulting text embeddings improved ImageNet zero-shot accuracy by about 3.5 percentage points. Domain-specific prompts help further — for satellite imagery, "a satellite photo of {class}" works better than the generic template.

28.2.8 Limitations of CLIP

Despite its strengths, CLIP has notable limitations:

Counting and spatial reasoning: CLIP struggles with counting objects ("three cats") and understanding spatial relationships ("the cup is to the left of the plate"). This is because the contrastive objective treats the text as a bag of concepts rather than modeling compositional structure.
Fine-grained recognition: Distinguishing between similar subspecies or specialized domain objects remains challenging without domain-specific training data.
Systematic compositionality: The Winoground benchmark reveals that CLIP often fails at compositional understanding — for example, it cannot reliably distinguish "a cat on a mat" from "a mat on a cat."
Bias: CLIP inherits biases from its web-crawled training data, including demographic biases in zero-shot classification. Studies have shown that CLIP associates certain occupations with specific genders or ethnicities, reflecting stereotypes in the training data.
Typographic attacks: CLIP can be fooled by placing text on objects — an apple with "iPod" written on it may be classified as an iPod, because the text encoder's understanding dominates.

28.3 Open-Vocabulary Detection and Segmentation

CLIP's aligned vision-language representations enable a new class of open-vocabulary models that can detect or segment objects described by arbitrary text queries.

28.3.1 OWL-ViT: Open-World Object Detection

OWL-ViT (Minderer et al., 2022) adapts a ViT-based CLIP model for object detection. The key idea is to use the patch tokens (not just the CLS token) as region proposals and match them against text queries using the CLIP similarity:

Remove CLIP's final pooling layer to retain per-patch features.
Add lightweight detection heads (box regression + objectness) on top of patch features.
Classify each detection by computing similarity to text query embeddings.

This enables detecting objects from novel categories simply by providing their text descriptions.

28.3.2 Grounding DINO and Grounded SAM

Grounding DINO combines the DINO object detector (not to be confused with the self-supervised learning method DINO from Chapter 26) with CLIP-like text grounding to achieve open-set object detection. The architecture processes text and image features through cross-modality fusion layers, enabling detection of any object that can be described in natural language.

Combined with SAM (Segment Anything Model, discussed in Section 26.7.4), it creates Grounded SAM — a pipeline that can segment any object described in text:

Grounding DINO receives the text query ("the black cat on the left") and the image, producing bounding boxes for the matching objects.
SAM receives the bounding boxes as prompts and produces precise segmentation masks.

This two-stage pipeline is remarkably powerful for zero-shot segmentation: it requires no training data for novel categories and can handle complex referring expressions. The practical applications span automated image editing, visual data annotation, and robotic manipulation where language-guided object identification is needed.

28.3.3 The Impact of Open-Vocabulary Models

Open-vocabulary detection and segmentation represent a paradigm shift from the traditional "closed-set" approach where models can only detect pre-defined categories. With CLIP-based open-vocabulary models, the set of detectable objects is limited only by what can be described in language. This has profound implications:

No re-training needed: Adding a new category requires only writing its text description, not collecting and labeling training data.
Long-tail detection: Rare objects that are underrepresented in training data can still be detected if they can be described.
Cross-domain transfer: A model trained on natural images can detect objects in medical, satellite, or industrial imagery if appropriate text descriptions are provided.

28.4 Image Captioning

Image captioning generates a natural language description of an image. It requires the model to recognize objects, understand their relationships, and express this understanding in grammatically correct, natural language.

28.4.1 Encoder-Decoder Architecture

Image captioning has evolved from template-based methods ("There is a [object] in the [scene]") through retrieval-based approaches (finding the most similar caption from a database) to the powerful neural encoder-decoder architectures used today. The modern approach treats captioning as a conditional language generation problem: given an image, generate a sequence of words that describes it.

The classic neural approach uses a vision encoder to extract image features and a language decoder to generate text:

$$p(\mathbf{y} | \mathbf{x}) = \prod_{t=1}^{T} p(y_t | y_1, \ldots, y_{t-1}, \mathbf{x})$$

Encoder: A pre-trained CNN or ViT produces a set of spatial features $\mathbf{H} = \{h_1, h_2, \ldots, h_K\}$ where $K$ is the number of spatial regions or patches.

Decoder: A Transformer decoder generates tokens autoregressively, using cross-attention to attend to the image features at each generation step.

28.4.2 BLIP and BLIP-2

BLIP (Li et al., 2022) introduced a unified vision-language pre-training framework with three objectives: 1. Image-text contrastive loss (CLIP-like alignment) 2. Image-text matching loss (binary classification of whether image and text match) 3. Image-grounded text generation loss (captioning objective)

BLIP-2 (Li et al., 2023) introduced the Q-Former (Querying Transformer), a lightweight module that bridges a frozen image encoder and a frozen language model:

Frozen image encoder (e.g., ViT-G from EVA-CLIP): Extracts visual features.
Q-Former: A small transformer with a set of learnable query tokens that cross-attend to the image features, producing a fixed set of visual tokens.
Frozen LLM (e.g., OPT or FlanT5): Receives the Q-Former's output as a prefix and generates the caption.

This design is extremely parameter-efficient: only the Q-Former (188M parameters) is trained, while the much larger image encoder and LLM remain frozen.

28.4.3 The Q-Former in Detail

The Q-Former is BLIP-2's most important architectural innovation and deserves a closer look. It consists of two transformer submodules that share self-attention layers:

Image transformer: A set of $K = 32$ learnable query tokens (each of dimension $d = 768$) that interact with frozen image features through cross-attention. These queries learn to extract the most relevant visual information.
Text transformer: Functions as both a text encoder and a text decoder, sharing parameters with the image transformer's self-attention layers.

The intuition is that the learnable queries act as an "information bottleneck." The frozen image encoder (ViT-G, with 1 billion parameters) produces hundreds of visual tokens, but the Q-Former distills this into just 32 tokens that capture the most task-relevant information. These 32 tokens then serve as a visual prefix for the frozen LLM.

The Q-Former is pre-trained in two stages:

Stage 1 — Vision-language representation learning (with frozen image encoder): - Image-text contrastive learning (aligns query outputs with text) - Image-grounded text generation (generates text conditioned on queries) - Image-text matching (binary classification: does this text match this image?)

Stage 2 — Vision-language generative learning (with frozen image encoder + frozen LLM): - The Q-Former's visual tokens are projected to the LLM's input dimension via a linear layer - The LLM is tasked with generating the caption given the visual prefix

This two-stage design means that only 188M parameters (the Q-Former and projection layer) are ever trained, while the 1B+ image encoder and the multi-billion parameter LLM remain frozen. The result is a highly parameter-efficient architecture that achieves state-of-the-art performance.

28.4.4 Evaluation Metrics for Captioning

Image captioning evaluation is notoriously challenging because there are many valid ways to describe an image. Standard metrics include:

BLEU (Bilingual Evaluation Understudy): Measures n-gram precision between generated and reference captions. BLEU-4 is standard. A BLEU-4 score of 40+ is considered strong on COCO Captions.
METEOR: Incorporates synonyms and stemming for more flexible matching. Correlates better with human judgments than BLEU.
CIDEr: Designed specifically for captioning; measures TF-IDF weighted n-gram similarity, emphasizing informative words. CIDEr scores of 130+ indicate strong performance on COCO.
SPICE: Evaluates semantic propositional content by parsing captions into scene graphs and comparing their tuples. This measures whether the caption captures the right objects, attributes, and relationships, regardless of phrasing.
CLIPScore: Uses CLIP to measure alignment between the image and the generated caption, not requiring reference captions. This is the only reference-free metric listed here and is increasingly popular because it evaluates whether the caption is faithful to the image rather than whether it matches human-written references.

Worked Example: Given the reference "A black cat sitting on a red couch" and a generated caption "A dark cat rests on a crimson sofa": - BLEU-4 would score low (few exact 4-gram matches) - METEOR would score higher (recognizes "dark"/"black", "rests"/"sitting" as related) - CIDEr would reward the informative words "cat" and "couch/sofa" - SPICE would score high (correct object: cat, attribute: dark/black, relation: on, object: couch/sofa) - CLIPScore would compare the caption embedding directly with the image embedding

In practice, researchers report multiple metrics and increasingly rely on CLIPScore and human evaluation for final comparisons.

28.4.5 Practical Image Captioning with BLIP-2

import torch
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image


def generate_caption(
    image_path: str,
    prompt: str = "",
    model_name: str = "Salesforce/blip2-opt-2.7b",
    max_new_tokens: int = 50,
) -> str:
    """Generate a caption for an image using BLIP-2.

    Args:
        image_path: Path to the input image.
        prompt: Optional text prompt to guide captioning.
        model_name: HuggingFace model identifier.
        max_new_tokens: Maximum number of tokens to generate.

    Returns:
        Generated caption string.
    """
    processor = Blip2Processor.from_pretrained(model_name)
    model = Blip2ForConditionalGeneration.from_pretrained(
        model_name, torch_dtype=torch.float16,
    )
    model = model.to("cuda")

    image = Image.open(image_path).convert("RGB")

    if prompt:
        inputs = processor(image, text=prompt, return_tensors="pt")
    else:
        inputs = processor(image, return_tensors="pt")

    inputs = {k: v.to("cuda") for k, v in inputs.items()}

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs, max_new_tokens=max_new_tokens
        )

    caption = processor.batch_decode(
        generated_ids, skip_special_tokens=True
    )[0].strip()
    return caption


# Usage:
# caption = generate_caption("photo.jpg")
# print(f"Caption: {caption}")
#
# # Visual QA style:
# answer = generate_caption(
#     "photo.jpg", prompt="Question: What color is the car? Answer:"
# )
# print(f"Answer: {answer}")

28.5 Visual Question Answering (VQA)

VQA requires a model to answer natural language questions about an image. It tests a deeper level of visual understanding than captioning, as questions can probe specific attributes, relationships, counting, and reasoning.

28.5.1 Problem Formulation

Given an image $\mathbf{x}$ and a question $\mathbf{q}$, the model must produce an answer $\mathbf{a}$. Approaches differ in whether the answer is:

Classification: Select from a fixed set of common answers (3,129 answers in VQAv2). This is simpler but limited.
Generative: Generate the answer as free-form text. More flexible and increasingly preferred.

28.5.2 Classical VQA Architectures

Early VQA models used multimodal fusion:

Image features: Extract from a pre-trained CNN (typically Faster R-CNN region features, producing 36-100 region proposals per image, each with a 2048-dimensional feature vector).
Question features: Encode with an LSTM or BERT to produce a question embedding.
Fusion: Combine using bilinear attention, Tucker decomposition, or cross-attention. The fusion module must learn which image regions are relevant to the question.
Classification: Predict over the answer vocabulary (top 3,129 answers for VQAv2).

Notable models include MCAN (Multi-modal Co-Attention Network) and LXMERT (Learning Cross-Modality Encoder Representations from Transformers), which used separate transformer encoders for each modality with cross-attention layers for fusion.

The fusion challenge: The core difficulty in VQA is learning to attend to the right image regions given the question. For "What color is the car?", the model must attend to the car region and extract its color attribute. For "How many people are in the image?", the model must attend to all person instances and count them. Different question types require fundamentally different attention patterns, making the fusion module's job extremely challenging.

28.5.3 Modern VQA with Vision-Language Models

Modern approaches frame VQA as text generation conditioned on the image:

Encode the image into visual tokens.
Concatenate visual tokens with the tokenized question.
Generate the answer autoregressively.

Models like BLIP-2 and PaLI achieve state-of-the-art VQA performance by leveraging the generative capabilities of large language models. The shift from classification-based to generation-based VQA is significant: classification VQA is limited to a fixed answer vocabulary, while generation-based VQA can produce arbitrary text answers, handle open-ended questions, and provide explanations or reasoning chains.

Example: For the question "What is the person in the red shirt doing?", a classification VQA model must select from pre-defined answers like "running," "standing," or "eating." A generative VQA model can produce "The person in the red shirt is jogging along the beach while listening to music through earbuds." This richer output enables more natural and informative visual question answering.

28.5.4 VQA Benchmarks

VQAv2: 1.1M questions on 200K images from COCO. Balanced to reduce language bias.
OK-VQA: Requires outside knowledge beyond what is visible in the image.
TextVQA: Requires reading and understanding text within images.
GQA: Compositional questions generated from scene graphs, enabling systematic evaluation of reasoning capabilities. GQA was designed to test whether models truly reason about visual content or simply exploit statistical biases in the question distribution.
VizWiz: Questions from visually impaired users photographing real-world scenes — often blurry, poorly framed, and genuinely challenging. VizWiz is uniquely important because it represents real-world accessibility needs rather than academic curiosity.

The language bias problem in VQA: A significant challenge in VQA evaluation is that models can achieve surprisingly high accuracy by exploiting language biases without truly understanding the image. For example, if 70% of "How many..." questions in the dataset have the answer "2," a model can achieve 70% accuracy on counting questions by always predicting "2." The VQAv2 dataset was specifically constructed to balance such biases by ensuring that for each question, there are two similar images with different correct answers.

28.6 LLaVA: Large Language and Vision Assistant

LLaVA (Liu et al., 2023) demonstrated that connecting a vision encoder to a large language model with a simple projection layer can create a powerful multimodal assistant.

28.6.1 Architecture

LLaVA's architecture is elegantly simple — almost deceptively so. While earlier multimodal architectures like Flamingo required carefully designed gated cross-attention layers and Perceiver Resamplers, LLaVA demonstrated that a straightforward approach could achieve competitive or superior performance. This simplicity has made LLaVA one of the most reproduced and extended architectures in multimodal AI.

The design philosophy follows a principle that has proven powerful throughout deep learning: rather than designing complex interaction mechanisms, provide the powerful pre-trained components with a simple interface and let the training data teach the model how to use them effectively.

The three components are:

Vision encoder: CLIP ViT-L/14 @ 336px, producing a grid of patch features $\mathbf{Z}_v \in \mathbb{R}^{N \times d_v}$ where $N = 576$ (24x24 grid) and $d_v = 1024$.
Projection layer: A linear projection (or 2-layer MLP in LLaVA-1.5) that maps visual features to the LLM's embedding space: $\mathbf{H}_v = \mathbf{W}\mathbf{Z}_v$ where $\mathbf{W} \in \mathbb{R}^{d_v \times d_{\text{LLM}}}$.
Large Language Model: Vicuna (LLaMA fine-tuned for instruction following), available in 7B and 13B variants.

The visual tokens are simply prepended to the text tokens and processed by the LLM as if they were regular text tokens. The LLM's self-attention naturally learns to attend to relevant visual features when generating responses.

28.6.2 Training Recipe

LLaVA uses a two-stage training process:

Stage 1: Feature Alignment Pre-training - Data: 595K image-text pairs from CC3M (filtered) - Only the projection layer $\mathbf{W}$ is trained; both the vision encoder and LLM are frozen - Objective: Image captioning (generate the caption given the image) - This stage teaches the projection layer to translate visual features into the LLM's "language"

Stage 2: Visual Instruction Tuning - Data: 158K multimodal instruction-following examples generated using GPT-4 - The projection layer and the LLM are trained; the vision encoder remains frozen - Data includes conversations, detailed descriptions, and complex reasoning questions - This stage teaches the model to follow multimodal instructions

28.6.3 LLaVA-1.5 Improvements

LLaVA-1.5 introduced several improvements: - MLP projection: A 2-layer MLP (with GELU activation) instead of a linear projection, improving feature transformation. - Higher resolution: 336px input to the vision encoder for finer visual detail. - More data: Academic VQA datasets mixed with instruction data for broader capability. - Stronger LLM: Vicuna-13B for improved reasoning.

28.6.4 Why LLaVA Works: Design Principles

LLaVA's success despite its architectural simplicity reveals important design principles for multimodal AI:

The power of pre-trained components: Both the CLIP vision encoder and the Vicuna LLM are already highly capable within their respective domains. The projection layer's only job is to "translate" between their representation spaces — a much simpler task than learning visual understanding from scratch.
Visual instruction tuning is the secret sauce: The GPT-4-generated instruction data is diverse and challenging, covering conversations, detailed descriptions, and complex reasoning. This data teaches the model not just what to see but how to reason about what it sees in a conversational context.
Frozen vision encoder preserves visual quality: By keeping the CLIP encoder frozen throughout training, LLaVA preserves the rich visual representations learned from 400M image-text pairs. Fine-tuning the vision encoder on only 158K examples would risk catastrophic forgetting.
The LLM handles the heavy lifting: Most of the reasoning, language generation, and instruction following comes from the LLM's pre-existing capabilities. The visual tokens simply provide additional context for the LLM to reason about, leveraging the same in-context learning abilities that make LLMs powerful for text-only tasks.

28.6.5 Conversation Capabilities

LLaVA can engage in multi-turn conversations about images:

User: [image of a kitchen] What is unusual about this image?
LLaVA: The unusual thing about this image is that there is a cat sitting on top
       of the kitchen counter next to a microwave...
User: What risks does this pose?
LLaVA: Having a cat on the kitchen counter poses several risks: 1) hygiene
       concerns as cat fur and paws can contaminate food preparation surfaces...

This conversational ability emerges from the visual instruction tuning stage.

28.6.6 LLaVA Variants and Successors

The LLaVA family has expanded rapidly:

LLaVA-1.5: Replaced linear projection with a 2-layer MLP, added academic VQA data, and used Vicuna-13B. Achieved state-of-the-art on 11 benchmarks despite its simplicity.
LLaVA-NeXT (LLaVA-1.6): Added support for variable-resolution inputs through "AnyRes" — the image is divided into tiles, each processed independently by the vision encoder, and the resulting tokens are concatenated. This enables understanding fine details in high-resolution images.
LLaVA-OneVision: Unified image and video understanding within a single model, extending to multi-image reasoning and video QA.

The LLaVA family's influence extends beyond the models themselves. The visual instruction tuning paradigm — generating multimodal training data using a powerful model like GPT-4 — has become a standard approach for training multimodal assistants. This data generation strategy is a practical example of "distillation at the data level," where a more capable model's reasoning is captured in a dataset that trains a more efficient model.

28.7 Flamingo and In-Context Multimodal Learning

28.7.1 Architecture

Flamingo (Alayrac et al., 2022) from DeepMind is a family of visual language models designed for few-shot learning. Its key architectural innovation is the Perceiver Resampler and gated cross-attention:

Vision encoder: NFNet (pre-trained, frozen) processes images (or video frames) into spatial features.
Perceiver Resampler: Learns a fixed set of visual tokens (64) from the variable-length vision features through cross-attention with learnable latent queries. This reduces the computational cost of visual conditioning.
Language model: Chinchilla (1.4B to 80B parameters, frozen) with injected gated cross-attention layers.
Gated cross-attention: New cross-attention layers inserted between frozen LLM layers, where text tokens attend to visual tokens. The output is gated by a learnable scalar initialized to zero, preserving the LLM's original behavior at initialization.

28.7.2 Few-Shot Multimodal Learning

Flamingo's defining capability is few-shot learning across modalities. By interleaving images and text in its context window, Flamingo can learn new tasks from just a few examples:

Input:  [image1] This is a cat. [image2] This is a dog. [image3] This is a
Output: hamster.

With just 4-32 examples, Flamingo matches or exceeds fine-tuned models on many VQA and captioning benchmarks, demonstrating that scaling multimodal models enables emergent in-context learning abilities.

28.7.3 The Gated Cross-Attention Mechanism

The gated cross-attention in Flamingo deserves a closer look, as it solves a fundamental challenge: how to inject visual information into a frozen LLM without disrupting its pre-trained language capabilities.

Between each pair of frozen LLM layers, Flamingo inserts a new cross-attention layer. The computation is:

$$\mathbf{y} = \mathbf{x} + \tanh(\alpha) \cdot \text{CrossAttn}(\mathbf{x}, \mathbf{v})$$

where: - $\mathbf{x}$ is the text hidden state from the previous frozen LLM layer - $\mathbf{v}$ is the visual tokens from the Perceiver Resampler - $\alpha$ is a learnable scalar initialized to zero

The $\tanh(\alpha)$ gate starts at $\tanh(0) = 0$, meaning the cross-attention initially contributes nothing — the model behaves exactly like the original frozen LLM. As training progresses, $\alpha$ grows, gradually allowing visual information to flow into the language model. This "zero-initialization" trick is a general design pattern for adding new capabilities to pre-trained models without disrupting existing ones, also used in ControlNet (as we saw in Section 27.10) and adapter tuning.

28.7.4 Interleaved Image-Text Understanding

Unlike models that process a single image, Flamingo handles arbitrarily interleaved sequences of images and text. This enables: - Multi-image reasoning ("Which of these two dishes looks healthier?") - Visual dialogue (conversational VQA with context) - Document understanding (reasoning about pages with figures and text)

28.7.5 Open-Source Flamingo Alternatives

The original Flamingo model was not released publicly, but open-source alternatives have been developed:

OpenFlamingo: An open-source reproduction using LLaMA and CLIP ViT-L/14, achieving comparable few-shot performance.
IDEFICS/IDEFICS2: HuggingFace's open-source multimodal models inspired by Flamingo, built on Mistral and SigLIP, supporting interleaved image-text inputs.

These models demonstrate that the Flamingo architecture can be reproduced with open-source components, democratizing access to few-shot multimodal learning.

28.8 Multimodal Embeddings

28.8.1 Shared Embedding Spaces

CLIP demonstrated that contrastive learning can create a shared embedding space where images and text are directly comparable. The power of this idea cannot be overstated: once two modalities share an embedding space, any operation defined on one modality (classification, retrieval, clustering, similarity search) automatically works for cross-modal operations. An image query finds relevant text; a text query finds relevant images; and similarity between any pair of items — regardless of modality — has a well-defined, meaningful value.

This concept has been extended to create universal multimodal embeddings that go beyond just images and text. The goal is a single embedding space where all modalities coexist, enabling cross-modal transfer that was never explicitly trained.

28.8.2 ImageBind

ImageBind (Girdhar et al., 2023) extends the CLIP approach to six modalities: images, text, audio, depth, thermal, and IMU (motion sensor) data. The key insight is that image-paired data exists for all these modalities (images with captions, images with corresponding audio, images with depth maps, etc.), so CLIP-style contrastive training between images and each other modality naturally aligns all modalities in a shared space — even modalities that were never directly paired during training. For example, although audio and text were never directly paired, both are aligned to images, so audio-text retrieval emerges as a zero-shot capability. This "binding" through a common modality is an elegant example of transitive alignment.

28.8.3 Embedding Arithmetic

Multimodal embedding spaces support arithmetic operations: - Cross-modal retrieval: Find images matching a text query, or text matching an image query. - Composition: Adding the embedding of "sunset" to an image embedding produces a query that retrieves sunset versions of similar scenes. - Analogy: Image A - "winter" + "summer" can retrieve a summer version of a winter scene.

28.8.4 Embedding Similarity Search

Efficient similarity search in embedding spaces uses: - FAISS (Facebook AI Similarity Search): Supports exact and approximate nearest neighbor search with various index types (flat, IVF, HNSW, PQ). For CLIP embeddings (512-dimensional), a flat index with cosine similarity can search 1 million images in about 10ms on CPU. - Annoy (Approximate Nearest Neighbors Oh Yeah): Tree-based approximate search. Lower memory footprint than FAISS but less accurate for high-dimensional embeddings. - Vector databases: Pinecone, Weaviate, Qdrant, Milvus provide production-ready vector search with filtering and metadata support. These handle the full lifecycle of embedding management: insertion, deletion, updating, and search with metadata filters.

The choice of search index type depends on the scale of the dataset:

Dataset size	Recommended index	Search time	Memory per embedding
< 100K	FAISS Flat (exact)	< 10ms	2 KB (512-dim FP32)
100K - 10M	FAISS IVF-Flat	< 50ms	2 KB
10M - 1B	FAISS IVF-PQ	< 100ms	~64 bytes
> 1B	Distributed vector DB	< 200ms	~64 bytes

Product quantization (PQ) reduces memory by compressing each 512-dimensional vector to about 64 bytes — a 32x reduction — at the cost of slightly lower recall. For most applications, this tradeoff is worthwhile at scale.

28.8.5 Building CLIP Embeddings for a Dataset

Here is a practical example of encoding a dataset of images into CLIP embeddings for later retrieval:

import torch
import numpy as np
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
from pathlib import Path


class CLIPEmbeddingIndex:
    """Build and query a CLIP embedding index for image retrieval.

    Args:
        model_name: HuggingFace CLIP model identifier.
        device: Device to run the model on.
    """

    def __init__(
        self,
        model_name: str = "openai/clip-vit-base-patch32",
        device: str = "cpu",
    ) -> None:
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.model = CLIPModel.from_pretrained(model_name).to(device)
        self.device = device
        self.embeddings: np.ndarray | None = None
        self.image_paths: list[str] = []

    @torch.no_grad()
    def index_images(
        self,
        image_dir: str,
        batch_size: int = 32,
    ) -> None:
        """Encode all images in a directory.

        Args:
            image_dir: Path to directory containing images.
            batch_size: Number of images to process at once.
        """
        paths = sorted(Path(image_dir).glob("*.jpg"))
        self.image_paths = [str(p) for p in paths]
        all_embeddings = []

        for i in range(0, len(paths), batch_size):
            batch_paths = paths[i : i + batch_size]
            images = [Image.open(p).convert("RGB") for p in batch_paths]
            inputs = self.processor(
                images=images, return_tensors="pt", padding=True,
            ).to(self.device)

            features = self.model.get_image_features(**inputs)
            features = features / features.norm(dim=-1, keepdim=True)
            all_embeddings.append(features.cpu().numpy())

        self.embeddings = np.concatenate(all_embeddings, axis=0)

    @torch.no_grad()
    def search(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
        """Search for images matching a text query.

        Args:
            query: Text search query.
            top_k: Number of results to return.

        Returns:
            List of (image_path, similarity_score) tuples.
        """
        inputs = self.processor(
            text=[query], return_tensors="pt", padding=True,
        ).to(self.device)

        text_features = self.model.get_text_features(**inputs)
        text_features = text_features / text_features.norm(
            dim=-1, keepdim=True
        )
        text_np = text_features.cpu().numpy()

        similarities = (self.embeddings @ text_np.T).squeeze()
        top_indices = np.argsort(similarities)[::-1][:top_k]

        results = []
        for idx in top_indices:
            results.append(
                (self.image_paths[idx], float(similarities[idx]))
            )
        return results

28.9 Building a Multimodal Retrieval System

A practical multimodal retrieval system allows users to search a database of images using natural language queries (text-to-image retrieval) or find similar images given a query image (image-to-image retrieval).

28.9.1 System Architecture

Indexing pipeline: - Load images from the dataset - Encode each image using CLIP's image encoder - Normalize embeddings to unit length - Store in a vector index (FAISS)
Query pipeline: - Encode the text query using CLIP's text encoder (or encode a query image using the image encoder) - Normalize the query embedding - Search the index for nearest neighbors - Return top-K results with similarity scores

28.9.2 Hybrid Retrieval

Combining text-based search with visual retrieval produces superior results: - Text query + metadata filter: "sunset photos" filtered by location = "California" - Image query + text refinement: "images similar to this one but in winter" - Re-ranking: Use a cross-encoder (which jointly processes query and candidate) to re-rank the top-K results from the fast retrieval stage.

28.9.3 Scaling Considerations

For production systems with millions or billions of images: - Use approximate nearest neighbor search (IVF-PQ in FAISS) to reduce search time from O(N) to O(sqrt(N)). - Quantize embeddings from FP32 to INT8 to reduce memory by 4x. - Shard the index across multiple machines for horizontal scaling. - Use GPU-accelerated search for throughput-critical applications.

28.9.4 Practical Considerations for Retrieval

Re-ranking with cross-encoders: The dual-encoder architecture (separate image and text encoders) enables fast retrieval because embeddings can be pre-computed and cached. However, a more accurate but slower approach uses a cross-encoder that jointly processes the query and each candidate. In practice, a two-stage pipeline works best:

Stage 1 (fast retrieval): Use CLIP dual-encoder to retrieve top-100 candidates in milliseconds.
Stage 2 (re-ranking): Use a cross-encoder (e.g., BLIP-2's image-text matching head) to re-score the top-100 candidates. This is 100x slower per item but much more accurate for the final ranking.

This two-stage approach is analogous to the retrieval-then-rerank pattern used in text search systems and RAG pipelines.

Handling query ambiguity: When users search for "apple," the system should ideally return both fruit and technology company images, or use context to disambiguate. Strategies include: - Returning diverse results that span multiple interpretations - Using follow-up queries to refine intent - Incorporating user interaction history for personalized disambiguation

28.10 Advanced Topics in Multimodal AI

28.10.1 Document Understanding

Models like LayoutLM and Donut process documents that contain both text and layout/visual information. They understand tables, forms, and structured documents by encoding both the text content and the spatial positions of text regions.

28.10.2 Visual Grounding

Visual grounding connects text phrases to specific image regions. Given "the red car on the left," the model must identify the bounding box or segmentation mask of the referenced object. This is more challenging than standard object detection because it requires understanding natural language referring expressions.

Referring expression comprehension requires the model to: 1. Parse the text to identify the target description ("the red car") and spatial constraints ("on the left"). 2. Identify candidate objects in the image. 3. Select the object that best matches the description.

Modern approaches use models like Grounding DINO (discussed in Section 28.3.2) that jointly process text and image features. The text tokens attend to image features through cross-attention, and a detection head predicts bounding boxes for the referred objects.

Benchmarks include RefCOCO (objects in COCO images), RefCOCO+ (no spatial language allowed, forcing attribute-based grounding), and RefCOCOg (longer, more complex expressions). State-of-the-art models achieve over 90% accuracy on RefCOCO, approaching human performance.

28.10.3 Text-to-Image Retrieval at Scale

Large-scale text-to-image retrieval powers search engines like Google Images and stock photography platforms. Key challenges include: - Handling query ambiguity: "Apple" could mean the fruit or the company. - Semantic gap: The visual appearance of "freedom" or "innovation" is subjective. - Relevance vs. diversity: Top results should be both relevant and diverse.

28.10.4 Multimodal Reasoning

Recent models (GPT-4V, Gemini) demonstrate increasingly sophisticated multimodal reasoning: - Chart and graph understanding: Extracting data and trends from visualizations. - Spatial reasoning: Understanding "above," "behind," "between" in complex scenes. - Temporal reasoning: Understanding sequences of images as a narrative. - Mathematical reasoning: Solving geometry problems from diagrams.

28.10.5 SigLIP and Improved Contrastive Objectives

SigLIP (Zhai et al., 2023) replaces the softmax-based contrastive loss with a sigmoid loss that operates on individual image-text pairs rather than requiring all-pairs comparison within a batch:

$$\mathcal{L}_{\text{SigLIP}} = -\frac{1}{N^2}\sum_{i=1}^{N}\sum_{j=1}^{N} \log \sigma(y_{ij} \cdot (\tau \cdot \cos(\mathbf{v}_i, \mathbf{u}_j) - b))$$

where $y_{ij} = 1$ for matched pairs and $y_{ij} = -1$ for unmatched pairs, $\tau$ is a learned temperature, and $b$ is a learned bias.

The key advantage is that the sigmoid loss decouples each pair's contribution, eliminating the need for the softmax normalization that requires all-pairs comparison within a batch. This enables more efficient training with smaller batch sizes while maintaining or improving performance. SigLIP has been adopted as the vision encoder in several recent multimodal models, including PaLI-3 and IDEFICS2.

28.10.6 The CogVLM Family

CogVLM (Wang et al., 2023) introduces a different approach to connecting vision and language: rather than projecting visual tokens into the LLM's input space, CogVLM adds a "visual expert" module to each transformer layer. Each layer has separate QKV projection matrices for visual tokens and text tokens, enabling deeper integration of visual information without modifying the pre-trained text weights.

This design principle — adding modality-specific parameters within each layer rather than just at the input — enables richer cross-modal interaction at every processing stage. CogVLM achieved state-of-the-art results on multiple benchmarks and inspired subsequent architectures to consider deeper integration strategies.

28.11 Practical Considerations

28.11.1 Choosing a Multimodal Model

The choice depends on your task:

Task	Recommended Approach
Zero-shot classification	CLIP or SigLIP
Image captioning	BLIP-2 or LLaVA
Visual QA	LLaVA or BLIP-2
Image retrieval	CLIP embeddings + FAISS
Open-vocabulary detection	OWL-ViT or Grounding DINO
Visual conversation	LLaVA or GPT-4V API
Few-shot multimodal tasks	Flamingo-style models

28.11.2 Fine-Tuning Strategies

Full fine-tuning: Unfreeze all parameters. Best performance but highest cost.
Adapter tuning: Add small trainable modules to frozen models. Good balance.
LoRA: Low-rank adaptation of attention weights. Parameter-efficient.
Prompt tuning: Learn soft prompts prepended to inputs. Most parameter-efficient.

28.11.3 Computational Requirements

Model	Parameters	GPU Memory (Inference)	GPU Memory (Training)
CLIP ViT-B/32	151M	~2 GB	~8 GB
CLIP ViT-L/14	427M	~4 GB	~16 GB
BLIP-2 (OPT-2.7B)	3.8B	~10 GB	~40 GB
LLaVA-1.5 (7B)	7.1B	~16 GB	~60 GB
LLaVA-1.5 (13B)	13.4B	~28 GB	~100 GB

28.11.4 Common Pitfalls and Best Practices

When deploying multimodal models, practitioners commonly encounter the following issues:

Resolution mismatch: CLIP was trained at 224x224 or 336x336, but real-world images may be much higher resolution. Simply resizing to the model's expected resolution can lose important details (small text, fine-grained differences). Solutions include tiling the image into multiple crops (as in LLaVA-NeXT's AnyRes) or using a higher-resolution vision encoder.
Hallucination: Multimodal LLMs frequently hallucinate — they may describe objects that are not in the image, especially common objects that frequently co-occur in training data. For example, if shown a kitchen counter, the model might mention a "toaster" that is not actually present. Evaluation benchmarks like POPE and CHAIR specifically measure hallucination rates.
Prompt sensitivity: Both CLIP zero-shot classification and multimodal LLM responses are sensitive to prompt wording. "Describe this image" produces different output than "What do you see in this photo?" Systematic prompt engineering and evaluation across multiple prompt variants is essential.
Tokenization budget: In models like LLaVA, visual tokens compete with text tokens for the LLM's context window. With 576 visual tokens (from the 24x24 CLIP grid), only 1,472 text tokens remain in a 2,048-token context window. For longer conversations or detailed prompts, this budget becomes limiting. Higher-resolution images (more visual tokens) exacerbate the problem.
Cross-lingual limitations: Most multimodal models are trained primarily on English image-text pairs. Performance on non-English queries can be significantly worse, especially for culturally specific visual concepts.

28.12 Summary

Multimodal models represent one of the most significant advances in AI, bridging the gap between visual perception and language understanding. The key ideas covered in this chapter are:

CLIP established that contrastive learning on large-scale image-text pairs produces powerful, aligned representations that enable zero-shot transfer to a wide range of vision tasks.
Zero-shot classification with CLIP eliminates the need for task-specific training data, using natural language prompts as classifiers.
Image captioning has evolved from encoder-decoder architectures to systems like BLIP-2 that bridge frozen vision encoders and frozen LLMs with lightweight trainable modules.
Visual Question Answering tests deeper visual understanding and has been transformed by generative approaches that leverage LLM capabilities.
LLaVA demonstrated that a simple projection layer connecting a CLIP vision encoder to an instruction-tuned LLM creates a powerful visual assistant, with visual instruction tuning as the key training innovation.
Flamingo showed that gated cross-attention enables few-shot multimodal learning, where models learn new tasks from interleaved image-text examples in context.
Multimodal embeddings from CLIP and ImageBind create shared spaces that enable cross-modal retrieval, composition, and reasoning.
Retrieval systems built on CLIP embeddings and vector search (FAISS) provide practical infrastructure for searching and organizing visual content at scale.

The field continues to advance rapidly, with models like GPT-4V and Gemini demonstrating increasingly sophisticated multimodal reasoning. The techniques in this chapter provide the foundation for understanding and building these systems.

The Emerging Architecture Patterns

Looking across the models discussed in this chapter, we see three dominant patterns for connecting vision and language:

Dual-encoder (CLIP-style): Separate encoders produce aligned embeddings. Best for retrieval and zero-shot classification. Fast at inference (encode once, search many times) but limited in cross-modal reasoning depth.
Bridge module (BLIP-2, Flamingo): A lightweight trainable module connects frozen vision and language models. Parameter-efficient and preserves the capabilities of both foundation models. Best for captioning, VQA, and few-shot tasks.
Direct integration (LLaVA, CogVLM): Visual tokens are injected directly into the LLM's token sequence. Maximum flexibility for open-ended conversation and reasoning. Higher training cost but the most versatile.

Understanding these patterns will help you choose the right approach for your application and anticipate the design choices of future multimodal models. As we move to Chapter 29 (Speech, Audio, and Music AI) and Chapter 30 (Video Understanding), we will see these same patterns adapted to additional modalities, demonstrating their generality across the full spectrum of perceptual AI.

28.13 Exercises

CLIP zero-shot evaluation: Using the HuggingFace CLIP implementation, evaluate zero-shot classification accuracy on CIFAR-10 using (a) the simple template "a photo of a {class}", (b) an ensemble of 10 different templates, and (c) class-specific prompts (e.g., "a photo of a truck, a type of vehicle"). Measure the accuracy improvement from prompt engineering.
Embedding space exploration: Encode 1,000 images from 10 different categories using CLIP ViT-B/32. Visualize the embeddings using t-SNE or UMAP. Do images from the same category cluster together? How do text embeddings of the category names relate to the image clusters?
Image-text retrieval: Build a simple retrieval system using CLIP embeddings and FAISS. Index 10,000 images from COCO, then evaluate recall@1, recall@5, and recall@10 for text-to-image retrieval using the COCO validation captions.
BLIP-2 captioning: Generate captions for 100 diverse images using BLIP-2 (OPT-2.7B variant). Compute CLIP-Score between each generated caption and its image. Identify failure cases where the model hallucinates objects not present in the image.
LLaVA conversation: Using a pre-trained LLaVA model, conduct multi-turn conversations about complex images (e.g., infographics, memes, diagrams). Document cases where the model succeeds at visual reasoning and cases where it fails. What types of visual understanding remain challenging?

References

Radford, A., Kim, J. W., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.
Li, J., Li, D., et al. (2022). BLIP: Bootstrapping Language-Image Pre-training. ICML 2022.
Li, J., Li, D., et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023.
Liu, H., Li, C., et al. (2023). Visual Instruction Tuning. NeurIPS 2023.
Alayrac, J.-B., Donahue, J., et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022.
Girdhar, R., El-Nouby, A., et al. (2023). ImageBind: One Embedding Space To Bind Them All. CVPR 2023.
Minderer, M., Gritsenko, A., et al. (2022). Simple Open-Vocabulary Object Detection with Vision Transformers. ECCV 2022.
Zhai, X., Mustafa, B., et al. (2023). Sigmoid Loss for Language Image Pre-Training. ICCV 2023.
Liu, S., Zeng, Z., et al. (2023). Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. ECCV 2024.