Chapter 28: Key Takeaways

CLIP and Contrastive Learning

CLIP learns a shared embedding space for images and text through contrastive learning on 400M image-text pairs. Matched pairs are pulled together; unmatched pairs are pushed apart. The result is a universal visual representation that understands natural language descriptions.
Zero-shot classification works by encoding class names as text prompts, encoding the image, and selecting the class with highest cosine similarity. No labeled training data is needed for the target task.
Prompt engineering is critical for zero-shot performance: using templates like "a photo of a {class}" improves accuracy by 5-8% over bare class names. Prompt ensembling (averaging embeddings from 80 diverse templates) provides an additional 2-3%.
CLIP's limitations include weak compositional understanding (counting, spatial relationships, attribute binding), bias from web-scraped training data, and difficulty with fine-grained recognition.

Encoder-decoder architectures use a vision encoder (CNN or ViT) to extract image features and a language decoder with cross-attention to generate text autoregressively.
BLIP-2 bridges frozen vision encoders and frozen LLMs through the Q-Former, a lightweight transformer with 32 learnable queries that compress visual information. Only the Q-Former (188M parameters) is trained.
Evaluation metrics include BLEU (n-gram precision), CIDEr (TF-IDF weighted similarity, designed for captioning), SPICE (semantic scene graph comparison), and CLIPScore (reference-free, CLIP-based alignment).

VQA requires answering natural language questions about images, testing deeper understanding than captioning: counting, spatial reasoning, attribute recognition, and world knowledge.
Modern VQA treats the task as text generation conditioned on visual tokens, leveraging the generative capabilities of LLMs rather than classifying over a fixed answer vocabulary.
Counting and spatial reasoning remain the weakest capabilities of current models, even for state-of-the-art systems.

LLaVA connects CLIP ViT-L/14 to a Vicuna LLM through a simple MLP projection. Visual tokens are prepended to text tokens and processed by the LLM's self-attention. The two-stage training (feature alignment, then visual instruction tuning) is key to its effectiveness.
Flamingo uses gated cross-attention injected between frozen LLM layers, where text tokens attend to visual tokens from a Perceiver Resampler. Zero initialization of the gates preserves the LLM's original capabilities.
Flamingo excels at few-shot multimodal learning — learning new tasks from just a few interleaved image-text examples in the context window.

ImageBind extends CLIP's approach to six modalities (images, text, audio, depth, thermal, IMU), using images as the "binding" modality. Alignment between non-image pairs (e.g., audio-text) emerges transitively.
Embedding spaces support cross-modal retrieval, embedding arithmetic (image + "sunset" retrieves sunset versions), and zero-shot transfer across modalities.
FAISS provides efficient similarity search for production retrieval systems. For millions of images, use approximate indices (IVF-PQ) with quantized embeddings for practical latency and memory.

For zero-shot classification, use CLIP with prompt ensembling. Upgrade to ViT-L/14 for accuracy-critical applications.
For image captioning, BLIP-2 with a frozen LLM provides strong results with minimal fine-tuning.
For VQA and visual conversation, LLaVA-1.5 offers the best open-source option, with the 13B variant providing meaningfully better reasoning than 7B.
For retrieval, CLIP embeddings in FAISS provide a fast, scalable baseline. Consider re-ranking with a cross-encoder for precision-critical applications.
Fine-tuning strategy depends on data availability: zero-shot (no data) -> linear probe (few hundred examples) -> LoRA (thousands) -> full fine-tuning (tens of thousands).