Case Study 2: Zero-Shot Classification with CLIP

Overview

Traditional image classifiers require labeled training data for every class they must recognize. CLIP's aligned vision-language embedding space enables zero-shot classification — predicting classes the model has never been explicitly trained on — by comparing image embeddings to text embeddings of class descriptions. In this case study, you will build a zero-shot classification pipeline, systematically optimize it through prompt engineering and ensemble techniques, and evaluate it against supervised baselines across multiple datasets.

Problem Statement

Build a zero-shot image classification system using CLIP that: 1. Classifies images into arbitrary categories defined only by text descriptions 2. Achieves competitive accuracy with supervised models through prompt engineering 3. Generalizes across datasets without any fine-tuning or labeled training data

Datasets

We evaluate on three datasets with increasing difficulty:

CIFAR-10: 10 broad classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). 10,000 test images at 32x32 resolution.
CIFAR-100: 100 fine-grained classes organized into 20 superclasses. 10,000 test images.
Oxford Flowers-102: 102 flower species. 6,149 test images. Tests fine-grained recognition.

Approach

Step 1: Baseline Zero-Shot Classification

The simplest approach uses class names directly:

Create text prompts: [class_name for each class]
Encode all prompts with CLIP's text encoder
For each test image, encode with CLIP's image encoder
Predict the class with highest cosine similarity

Step 2: Single-Template Prompt Engineering

We test various prompt templates and measure their impact:

"{class}" (bare class name)
"a photo of a {class}"
"a photo of a {class}, a type of object"
"a centered photo of a {class}"
"a good photo of a {class}"

For domain-specific datasets (Flowers-102): - "a photo of a {class}, a type of flower" - "a close-up photo of a {class} flower"

Step 3: Prompt Ensemble

Following CLIP's original paper, we ensemble multiple prompt templates:

Define $K$ diverse templates for each class
Encode each template and average the embeddings per class: $\bar{\mathbf{u}}_c = \frac{1}{K}\sum_{k=1}^{K} f_{\text{text}}(\text{template}_k(c))$
Normalize the averaged embedding: $\hat{\mathbf{u}}_c = \bar{\mathbf{u}}_c / \|\bar{\mathbf{u}}_c\|$
Use the ensembled embeddings for classification

We use 80 templates from OpenAI's CLIP prompt collection, covering diverse contexts ("a painting of a {class}", "a cartoon {class}", "a blurry photo of a {class}", etc.).

Step 4: CLIP Model Variant Comparison

We compare multiple CLIP model variants: - CLIP ViT-B/32 (151M params, fastest) - CLIP ViT-B/16 (151M params, better resolution) - CLIP ViT-L/14 (427M params, best quality) - CLIP ViT-L/14@336px (427M params, highest resolution)

Step 5: Linear Probe Comparison

To understand the gap between zero-shot and supervised approaches, we also train a linear classifier on frozen CLIP features:

Extract CLIP image features for all training images
Train a logistic regression classifier (sklearn) on these features
Evaluate on the test set

This represents the performance achievable with labeled data but without fine-tuning the CLIP backbone.

Results

CIFAR-10 Results

Method	Top-1 Accuracy
CLIP ViT-B/32 (bare class name)	83.7%
CLIP ViT-B/32 (single template)	89.2%
CLIP ViT-B/32 (80-template ensemble)	91.3%
CLIP ViT-L/14@336px (ensemble)	95.6%
Linear probe on CLIP ViT-L/14	97.2%
Supervised ResNet-50 (trained on CIFAR-10)	95.4%

CIFAR-100 Results

Method	Top-1 Accuracy
CLIP ViT-B/32 (bare class name)	58.2%
CLIP ViT-B/32 (ensemble)	65.1%
CLIP ViT-L/14@336px (ensemble)	75.3%
Linear probe on CLIP ViT-L/14	82.7%

Flowers-102 Results

Method	Top-1 Accuracy
CLIP ViT-B/32 (bare class name)	61.5%
CLIP ViT-B/32 (domain template)	67.3%
CLIP ViT-B/32 (ensemble)	69.8%
CLIP ViT-L/14@336px (ensemble)	76.1%
Linear probe on CLIP ViT-L/14	97.5%

Key Observations

Prompt engineering provides 5-8% improvement over bare class names on CIFAR-10, demonstrating that context matters significantly.
Ensemble provides an additional 2-3% improvement over the best single template, at the cost of $K \times$ more text encodings (computed once, amortized).
Model scale matters enormously: ViT-L/14@336px achieves 95.6% on CIFAR-10 zero-shot, competitive with supervised ResNet-50.
Fine-grained tasks show the largest gap: On Flowers-102, zero-shot achieves 76.1% vs. 97.5% for the linear probe, indicating that distinguishing similar species requires more than general visual-semantic alignment.
Linear probing dramatically closes the gap: With even minimal labeled data, CLIP features provide excellent classification performance, suggesting the representations are rich but the zero-shot interface is the bottleneck.

Error Analysis

Common failure patterns on CIFAR-100:

Visually similar classes: "maple tree" vs. "oak tree" — CLIP confuses fine-grained categories within the same superclass
Context-dependent classes: "lawn mower" confused with "tractor" — both are outdoor machines
Low-resolution challenges: 32x32 CIFAR images lose details that CLIP might otherwise use

Confusion matrix analysis reveals:

Most errors are within semantically related groups (vehicles, animals, plants)
CLIP almost never confuses semantically distant categories (e.g., "fish" vs. "rocket")
This suggests the embedding space captures semantic similarity well but lacks fine-grained discrimination

Key Lessons

Zero-shot classification is remarkably effective for coarse categories (CIFAR-10 level), approaching supervised performance without any labeled data.
Prompt engineering is not optional: It is the single largest factor in zero-shot performance. Investing time in prompt design yields significant returns.
Domain-specific templates outperform generic ones for specialized datasets. Knowing the domain (flowers, medical images, satellite images) and encoding that knowledge in the prompt is essential.
The zero-shot to linear-probe gap indicates representation quality: When the gap is small, the zero-shot prompt is well-aligned with the representation. When the gap is large, there is more information in the features than the zero-shot interface can access.
CLIP model selection depends on the use case: ViT-B/32 is 4x faster than ViT-L/14 with only moderate accuracy loss, making it suitable for latency-sensitive applications.

Extensions

Implement few-shot classification using CLIP features with a k-NN classifier
Test on distribution-shifted versions of datasets (e.g., ImageNet-Sketch, ImageNet-R)
Build a dynamic prompt optimization system that learns optimal prompts for a given dataset
Combine zero-shot CLIP with active learning to efficiently label the most informative examples

Code Reference

The complete implementation is available in code/case-study-code.py.