Case Study 2: Zero-Shot Classification with CLIP
Overview
Traditional image classifiers require labeled training data for every class they must recognize. CLIP's aligned vision-language embedding space enables zero-shot classification — predicting classes the model has never been explicitly trained on — by comparing image embeddings to text embeddings of class descriptions. In this case study, you will build a zero-shot classification pipeline, systematically optimize it through prompt engineering and ensemble techniques, and evaluate it against supervised baselines across multiple datasets.
Problem Statement
Build a zero-shot image classification system using CLIP that: 1. Classifies images into arbitrary categories defined only by text descriptions 2. Achieves competitive accuracy with supervised models through prompt engineering 3. Generalizes across datasets without any fine-tuning or labeled training data
Datasets
We evaluate on three datasets with increasing difficulty:
- CIFAR-10: 10 broad classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). 10,000 test images at 32x32 resolution.
- CIFAR-100: 100 fine-grained classes organized into 20 superclasses. 10,000 test images.
- Oxford Flowers-102: 102 flower species. 6,149 test images. Tests fine-grained recognition.
Approach
Step 1: Baseline Zero-Shot Classification
The simplest approach uses class names directly:
- Create text prompts:
[class_name for each class] - Encode all prompts with CLIP's text encoder
- For each test image, encode with CLIP's image encoder
- Predict the class with highest cosine similarity
Step 2: Single-Template Prompt Engineering
We test various prompt templates and measure their impact:
"{class}"(bare class name)"a photo of a {class}""a photo of a {class}, a type of object""a centered photo of a {class}""a good photo of a {class}"
For domain-specific datasets (Flowers-102):
- "a photo of a {class}, a type of flower"
- "a close-up photo of a {class} flower"
Step 3: Prompt Ensemble
Following CLIP's original paper, we ensemble multiple prompt templates:
- Define $K$ diverse templates for each class
- Encode each template and average the embeddings per class: $\bar{\mathbf{u}}_c = \frac{1}{K}\sum_{k=1}^{K} f_{\text{text}}(\text{template}_k(c))$
- Normalize the averaged embedding: $\hat{\mathbf{u}}_c = \bar{\mathbf{u}}_c / \|\bar{\mathbf{u}}_c\|$
- Use the ensembled embeddings for classification
We use 80 templates from OpenAI's CLIP prompt collection, covering diverse contexts ("a painting of a {class}", "a cartoon {class}", "a blurry photo of a {class}", etc.).
Step 4: CLIP Model Variant Comparison
We compare multiple CLIP model variants: - CLIP ViT-B/32 (151M params, fastest) - CLIP ViT-B/16 (151M params, better resolution) - CLIP ViT-L/14 (427M params, best quality) - CLIP ViT-L/14@336px (427M params, highest resolution)
Step 5: Linear Probe Comparison
To understand the gap between zero-shot and supervised approaches, we also train a linear classifier on frozen CLIP features:
- Extract CLIP image features for all training images
- Train a logistic regression classifier (sklearn) on these features
- Evaluate on the test set
This represents the performance achievable with labeled data but without fine-tuning the CLIP backbone.
Results
CIFAR-10 Results
| Method | Top-1 Accuracy |
|---|---|
| CLIP ViT-B/32 (bare class name) | 83.7% |
| CLIP ViT-B/32 (single template) | 89.2% |
| CLIP ViT-B/32 (80-template ensemble) | 91.3% |
| CLIP ViT-L/14@336px (ensemble) | 95.6% |
| Linear probe on CLIP ViT-L/14 | 97.2% |
| Supervised ResNet-50 (trained on CIFAR-10) | 95.4% |
CIFAR-100 Results
| Method | Top-1 Accuracy |
|---|---|
| CLIP ViT-B/32 (bare class name) | 58.2% |
| CLIP ViT-B/32 (ensemble) | 65.1% |
| CLIP ViT-L/14@336px (ensemble) | 75.3% |
| Linear probe on CLIP ViT-L/14 | 82.7% |
Flowers-102 Results
| Method | Top-1 Accuracy |
|---|---|
| CLIP ViT-B/32 (bare class name) | 61.5% |
| CLIP ViT-B/32 (domain template) | 67.3% |
| CLIP ViT-B/32 (ensemble) | 69.8% |
| CLIP ViT-L/14@336px (ensemble) | 76.1% |
| Linear probe on CLIP ViT-L/14 | 97.5% |
Key Observations
-
Prompt engineering provides 5-8% improvement over bare class names on CIFAR-10, demonstrating that context matters significantly.
-
Ensemble provides an additional 2-3% improvement over the best single template, at the cost of $K \times$ more text encodings (computed once, amortized).
-
Model scale matters enormously: ViT-L/14@336px achieves 95.6% on CIFAR-10 zero-shot, competitive with supervised ResNet-50.
-
Fine-grained tasks show the largest gap: On Flowers-102, zero-shot achieves 76.1% vs. 97.5% for the linear probe, indicating that distinguishing similar species requires more than general visual-semantic alignment.
-
Linear probing dramatically closes the gap: With even minimal labeled data, CLIP features provide excellent classification performance, suggesting the representations are rich but the zero-shot interface is the bottleneck.
Error Analysis
Common failure patterns on CIFAR-100:
- Visually similar classes: "maple tree" vs. "oak tree" — CLIP confuses fine-grained categories within the same superclass
- Context-dependent classes: "lawn mower" confused with "tractor" — both are outdoor machines
- Low-resolution challenges: 32x32 CIFAR images lose details that CLIP might otherwise use
Confusion matrix analysis reveals:
- Most errors are within semantically related groups (vehicles, animals, plants)
- CLIP almost never confuses semantically distant categories (e.g., "fish" vs. "rocket")
- This suggests the embedding space captures semantic similarity well but lacks fine-grained discrimination
Key Lessons
-
Zero-shot classification is remarkably effective for coarse categories (CIFAR-10 level), approaching supervised performance without any labeled data.
-
Prompt engineering is not optional: It is the single largest factor in zero-shot performance. Investing time in prompt design yields significant returns.
-
Domain-specific templates outperform generic ones for specialized datasets. Knowing the domain (flowers, medical images, satellite images) and encoding that knowledge in the prompt is essential.
-
The zero-shot to linear-probe gap indicates representation quality: When the gap is small, the zero-shot prompt is well-aligned with the representation. When the gap is large, there is more information in the features than the zero-shot interface can access.
-
CLIP model selection depends on the use case: ViT-B/32 is 4x faster than ViT-L/14 with only moderate accuracy loss, making it suitable for latency-sensitive applications.
Extensions
- Implement few-shot classification using CLIP features with a k-NN classifier
- Test on distribution-shifted versions of datasets (e.g., ImageNet-Sketch, ImageNet-R)
- Build a dynamic prompt optimization system that learns optimal prompts for a given dataset
- Combine zero-shot CLIP with active learning to efficiently label the most informative examples
Code Reference
The complete implementation is available in code/case-study-code.py.