Chapter 28: Further Reading

Foundational Papers

Contrastive Vision-Language Learning

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021. The CLIP paper. Demonstrates that contrastive pre-training on 400M image-text pairs produces visual representations with remarkable zero-shot transfer capabilities. Essential reading for understanding the foundation of modern vision-language AI.

  • Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., ... & Duerig, T. (2021). "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision." ICML 2021. Google's ALIGN paper, trained on 1.8B noisy image-text pairs. Demonstrates that scaling data (even noisy data) is more important than curating clean datasets, complementing CLIP's findings.

  • Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). "Sigmoid Loss for Language Image Pre-Training." ICCV 2023. SigLIP replaces CLIP's softmax contrastive loss with a sigmoid loss, enabling more efficient training with smaller batch sizes while matching or exceeding CLIP's performance.

Multimodal Language Models

  • Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). "Visual Instruction Tuning." NeurIPS 2023. The LLaVA paper. Shows that a simple projection layer connecting CLIP to an LLM, combined with visual instruction tuning data generated by GPT-4, creates a powerful visual assistant. Influential for its simplicity and effectiveness.

  • Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., ... & Simonyan, K. (2022). "Flamingo: a Visual Language Model for Few-Shot Learning." NeurIPS 2022. Introduces Flamingo's gated cross-attention architecture for injecting visual information into frozen LLMs. Demonstrates remarkable few-shot multimodal learning capabilities.

  • Li, J., Li, D., Savarese, S., & Hoi, S. (2023). "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." ICML 2023. Introduces the Q-Former architecture that efficiently bridges frozen vision encoders and frozen LLMs. Achieves strong performance with minimal trainable parameters.

Image Captioning and VQA

  • Li, J., Li, D., Xiong, C., & Hoi, S. (2022). "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation." ICML 2022. Introduces a unified pre-training framework with contrastive, matching, and generative objectives. Also introduces CapFilt, a technique for bootstrapping high-quality training captions.

  • Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). "VQA: Visual Question Answering." ICCV 2015. The foundational VQA paper that defined the task, introduced the VQA dataset, and established evaluation protocols. Essential for understanding the history and challenges of VQA.

  • Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering." CVPR 2017. Introduces VQAv2, which balances the dataset to reduce language bias (ensuring each question has different answers for different images).

Multimodal Embeddings

  • Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., & Misra, I. (2023). "ImageBind: One Embedding Space To Bind Them All." CVPR 2023. Extends contrastive learning to six modalities through images as a binding modality. Demonstrates emergent cross-modal alignment between modalities never directly paired during training.

Open-Vocabulary Detection and Grounding

  • Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., ... & Houlsby, N. (2022). "Simple Open-Vocabulary Object Detection with Vision Transformers." ECCV 2022. OWL-ViT adapts CLIP for open-vocabulary object detection, enabling detection of objects described by arbitrary text queries.

  • Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., ... & Zhang, L. (2023). "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection." ECCV 2024. Combines the DINO object detector with text grounding for state-of-the-art open-set detection. Combined with SAM, it enables text-guided segmentation of arbitrary objects.

Benchmarks and Evaluation

  • Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., & Ross, C. (2022). "Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality." CVPR 2022. A challenging benchmark that reveals CLIP's compositional understanding failures. Important for understanding the limitations of current vision-language models.

  • Yuksekgonul, B., Bianchi, F., Kalluri, P., Jurafsky, D., & Zou, J. (2023). "When and Why Vision-Language Models Behave like Bags-of-Words, and What to Do About It?" ICLR 2023. Analyzes why CLIP and similar models often behave like "bags of words," failing to capture word order and compositional structure.

Surveys and Tutorials

  • Du, Y., Liu, Z., Li, J., & Le, W. X. (2022). "A Survey of Vision-Language Pre-Training in Multimodal Processing." arXiv:2202.09061. A comprehensive survey of vision-language pre-training methods, covering architectures, objectives, datasets, and downstream tasks.

Online Resources

  • OpenAI CLIP Repository: https://github.com/openai/CLIP — Official implementation with model weights and usage examples.
  • Hugging Face Transformers: https://huggingface.co/docs/transformers/model_doc/clip — CLIP integration with the Transformers library, including all model variants.
  • LLaVA Project: https://llava-vl.github.io/ — Official LLaVA page with model weights, training data, and demo.
  • LAION: https://laion.ai/ — The organization behind LAION-5B, the open dataset used to train open-source CLIP alternatives.

Looking Ahead

The multimodal foundations from this chapter connect directly to several upcoming topics:

  • Chapter 27 (Diffusion Models): CLIP provides the text encoder for Stable Diffusion, and CLIP Score is a key evaluation metric for text-to-image generation.
  • Chapter 30 (Video Understanding): Video-language models extend the image-language architectures covered here to the temporal domain.
  • Chapter 31 (RAG): Multimodal retrieval systems built on CLIP embeddings can be integrated into retrieval-augmented generation pipelines.