Case Study 2: Zero-Shot Classification with CLIP

Overview

Traditional image classifiers require labeled training data for every class they must recognize. CLIP's aligned vision-language embedding space enables zero-shot classification — predicting classes the model has never been explicitly trained on — by comparing image embeddings to text embeddings of class descriptions. In this case study, you will build a zero-shot classification pipeline, systematically optimize it through prompt engineering and ensemble techniques, and evaluate it against supervised baselines across multiple datasets.

Problem Statement

Build a zero-shot image classification system using CLIP that: 1. Classifies images into arbitrary categories defined only by text descriptions 2. Achieves competitive accuracy with supervised models through prompt engineering 3. Generalizes across datasets without any fine-tuning or labeled training data

Datasets

We evaluate on three datasets with increasing difficulty:

  1. CIFAR-10: 10 broad classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). 10,000 test images at 32x32 resolution.
  2. CIFAR-100: 100 fine-grained classes organized into 20 superclasses. 10,000 test images.
  3. Oxford Flowers-102: 102 flower species. 6,149 test images. Tests fine-grained recognition.

Approach

Step 1: Baseline Zero-Shot Classification

The simplest approach uses class names directly:

  1. Create text prompts: [class_name for each class]
  2. Encode all prompts with CLIP's text encoder
  3. For each test image, encode with CLIP's image encoder
  4. Predict the class with highest cosine similarity

Step 2: Single-Template Prompt Engineering

We test various prompt templates and measure their impact:

  • "{class}" (bare class name)
  • "a photo of a {class}"
  • "a photo of a {class}, a type of object"
  • "a centered photo of a {class}"
  • "a good photo of a {class}"

For domain-specific datasets (Flowers-102): - "a photo of a {class}, a type of flower" - "a close-up photo of a {class} flower"

Step 3: Prompt Ensemble

Following CLIP's original paper, we ensemble multiple prompt templates:

  1. Define $K$ diverse templates for each class
  2. Encode each template and average the embeddings per class: $\bar{\mathbf{u}}_c = \frac{1}{K}\sum_{k=1}^{K} f_{\text{text}}(\text{template}_k(c))$
  3. Normalize the averaged embedding: $\hat{\mathbf{u}}_c = \bar{\mathbf{u}}_c / \|\bar{\mathbf{u}}_c\|$
  4. Use the ensembled embeddings for classification

We use 80 templates from OpenAI's CLIP prompt collection, covering diverse contexts ("a painting of a {class}", "a cartoon {class}", "a blurry photo of a {class}", etc.).

Step 4: CLIP Model Variant Comparison

We compare multiple CLIP model variants: - CLIP ViT-B/32 (151M params, fastest) - CLIP ViT-B/16 (151M params, better resolution) - CLIP ViT-L/14 (427M params, best quality) - CLIP ViT-L/14@336px (427M params, highest resolution)

Step 5: Linear Probe Comparison

To understand the gap between zero-shot and supervised approaches, we also train a linear classifier on frozen CLIP features:

  1. Extract CLIP image features for all training images
  2. Train a logistic regression classifier (sklearn) on these features
  3. Evaluate on the test set

This represents the performance achievable with labeled data but without fine-tuning the CLIP backbone.

Results

CIFAR-10 Results

Method Top-1 Accuracy
CLIP ViT-B/32 (bare class name) 83.7%
CLIP ViT-B/32 (single template) 89.2%
CLIP ViT-B/32 (80-template ensemble) 91.3%
CLIP ViT-L/14@336px (ensemble) 95.6%
Linear probe on CLIP ViT-L/14 97.2%
Supervised ResNet-50 (trained on CIFAR-10) 95.4%

CIFAR-100 Results

Method Top-1 Accuracy
CLIP ViT-B/32 (bare class name) 58.2%
CLIP ViT-B/32 (ensemble) 65.1%
CLIP ViT-L/14@336px (ensemble) 75.3%
Linear probe on CLIP ViT-L/14 82.7%

Flowers-102 Results

Method Top-1 Accuracy
CLIP ViT-B/32 (bare class name) 61.5%
CLIP ViT-B/32 (domain template) 67.3%
CLIP ViT-B/32 (ensemble) 69.8%
CLIP ViT-L/14@336px (ensemble) 76.1%
Linear probe on CLIP ViT-L/14 97.5%

Key Observations

  1. Prompt engineering provides 5-8% improvement over bare class names on CIFAR-10, demonstrating that context matters significantly.

  2. Ensemble provides an additional 2-3% improvement over the best single template, at the cost of $K \times$ more text encodings (computed once, amortized).

  3. Model scale matters enormously: ViT-L/14@336px achieves 95.6% on CIFAR-10 zero-shot, competitive with supervised ResNet-50.

  4. Fine-grained tasks show the largest gap: On Flowers-102, zero-shot achieves 76.1% vs. 97.5% for the linear probe, indicating that distinguishing similar species requires more than general visual-semantic alignment.

  5. Linear probing dramatically closes the gap: With even minimal labeled data, CLIP features provide excellent classification performance, suggesting the representations are rich but the zero-shot interface is the bottleneck.

Error Analysis

Common failure patterns on CIFAR-100:

  • Visually similar classes: "maple tree" vs. "oak tree" — CLIP confuses fine-grained categories within the same superclass
  • Context-dependent classes: "lawn mower" confused with "tractor" — both are outdoor machines
  • Low-resolution challenges: 32x32 CIFAR images lose details that CLIP might otherwise use

Confusion matrix analysis reveals:

  • Most errors are within semantically related groups (vehicles, animals, plants)
  • CLIP almost never confuses semantically distant categories (e.g., "fish" vs. "rocket")
  • This suggests the embedding space captures semantic similarity well but lacks fine-grained discrimination

Key Lessons

  1. Zero-shot classification is remarkably effective for coarse categories (CIFAR-10 level), approaching supervised performance without any labeled data.

  2. Prompt engineering is not optional: It is the single largest factor in zero-shot performance. Investing time in prompt design yields significant returns.

  3. Domain-specific templates outperform generic ones for specialized datasets. Knowing the domain (flowers, medical images, satellite images) and encoding that knowledge in the prompt is essential.

  4. The zero-shot to linear-probe gap indicates representation quality: When the gap is small, the zero-shot prompt is well-aligned with the representation. When the gap is large, there is more information in the features than the zero-shot interface can access.

  5. CLIP model selection depends on the use case: ViT-B/32 is 4x faster than ViT-L/14 with only moderate accuracy loss, making it suitable for latency-sensitive applications.

Extensions

  • Implement few-shot classification using CLIP features with a k-NN classifier
  • Test on distribution-shifted versions of datasets (e.g., ImageNet-Sketch, ImageNet-R)
  • Build a dynamic prompt optimization system that learns optimal prompts for a given dataset
  • Combine zero-shot CLIP with active learning to efficiently label the most informative examples

Code Reference

The complete implementation is available in code/case-study-code.py.