Case Study 1: Building a VQA System

Overview

Visual Question Answering (VQA) sits at the intersection of computer vision and natural language processing, requiring models to understand both the visual content of an image and the semantic intent of a question. In this case study, you will build a complete VQA system using BLIP-2 from Hugging Face, evaluate its capabilities across different question types, and analyze its failure modes to understand the current limitations of vision-language models.

Problem Statement

Build and evaluate a VQA system that can answer diverse questions about images, including: - Object identification ("What is the person holding?") - Counting ("How many dogs are in the image?") - Color and attribute recognition ("What color is the car?") - Spatial reasoning ("What is to the left of the table?") - Activity recognition ("What is the person doing?") - Scene understanding ("Where was this photo taken?")

Dataset

We evaluate on a curated subset of the VQAv2 validation set containing 500 image-question-answer triplets, stratified across question types: Yes/No (150), Number (100), and Other (250). Additionally, we collect 50 custom examples with known ground truth to test specific capabilities.

Approach

Step 1: Model Selection and Setup

We use BLIP-2 with OPT-2.7B as the language model backbone. This provides a strong balance between capability and computational requirements:

  • BLIP-2 uses a frozen EVA-CLIP ViT-G/14 image encoder (1B parameters)
  • The Q-Former (188M parameters) bridges vision and language
  • OPT-2.7B provides generative language capabilities
  • Total inference cost: ~10 GB GPU memory

Step 2: Inference Pipeline

The VQA pipeline: 1. Load and preprocess the image (resize to 364x364, normalize) 2. Tokenize the question with the BLIP-2 processor 3. Generate the answer using the model with beam search (num_beams=5, max_length=10) 4. Post-process the output (strip whitespace, lowercase for evaluation)

Step 3: Evaluation Across Question Types

We systematically evaluate accuracy across question categories:

Yes/No questions: Evaluate exact match accuracy after normalization.

Number questions: Evaluate both exact match and tolerance-based accuracy (within +/- 1).

Open-ended questions: Use VQA accuracy metric — the answer is considered correct if it matches at least 3 out of 10 human annotators' answers.

Step 4: Failure Analysis

We categorize failure cases into: - Counting errors: Model says "2" when the answer is "3" - Hallucination: Model describes objects not present in the image - Spatial errors: Incorrect left/right/above/below relationships - Knowledge gaps: Questions requiring world knowledge the model lacks - Ambiguity: Valid answers marked incorrect due to evaluation limitations

Step 5: Comparison with Zero-Shot CLIP

For Yes/No questions, we compare BLIP-2 with a CLIP-based approach: - Reformulate the question as two prompts (affirmative and negative) - Compare CLIP similarity scores to determine the answer - Example: "Is there a dog?" -> compare "a photo with a dog" vs. "a photo without a dog"

Results

Accuracy by Question Type

Question Type BLIP-2 Accuracy CLIP Zero-Shot Random Baseline
Yes/No 82.7% 65.3% 50.0%
Number 41.0% N/A 12.5%
Other 58.4% N/A 1.2%
Overall 62.8% N/A 18.6%

Detailed Failure Analysis

Counting (41% accuracy): - Accurate for 0-2 objects (78% accuracy) - Drops significantly for 3+ objects (21% accuracy) - Most common error: underestimating count by 1

Spatial reasoning (47% accuracy): - Left/right distinctions are particularly difficult - "In front of" / "behind" slightly better (from perspective cues)

Color recognition (79% accuracy): - High accuracy for dominant colors - Struggles with small objects or mixed colors

Activity recognition (71% accuracy): - Good for common activities (walking, sitting, eating) - Poor for ambiguous actions (is the person thinking or resting?)

Key Lessons

  1. Generative VQA outperforms classification VQA: BLIP-2's generative approach naturally handles open-ended answers without being limited to a fixed vocabulary.

  2. Counting remains the weakest capability: Even state-of-the-art vision-language models struggle with precise counting, suggesting that current architectures lack the spatial precision needed for enumeration.

  3. Prompt formatting matters: Phrasing the question as "Question: {q} Answer:" produces better results than just passing the question, because it matches the model's training format.

  4. Beam search vs. greedy decoding: Beam search (num_beams=5) improves accuracy by 2-3% over greedy decoding, especially for longer answers.

  5. Model confidence correlates with accuracy: When BLIP-2's top beam has high probability (>0.8), accuracy is 85%+. When confidence is low (<0.3), accuracy drops to 35%.

  6. CLIP is surprisingly effective for Yes/No: With careful prompt engineering, CLIP achieves 65% on Yes/No questions without any VQA-specific training, demonstrating the power of aligned multimodal representations.

Ethical Considerations

  • VQA models can produce confident but incorrect answers, which is dangerous in high-stakes applications (medical, legal)
  • Models may reflect biases in their training data (e.g., gender stereotypes in activity recognition)
  • The "accuracy" metric may not capture the nuance needed for real-world deployment

Code Reference

The complete implementation is available in code/case-study-code.py.