Case Study 1: Video Classification with Video Transformers
Overview
Video classification is the most fundamental video understanding task: given a video clip, predict the action or activity being performed. In this case study, you will build a complete video classification pipeline using a pre-trained Video Swin Transformer, fine-tune it on the UCF-101 dataset, and implement the standard multi-clip evaluation protocol. You will compare the transformer-based approach with simpler baselines and analyze when temporal modeling makes the biggest difference.
Problem Statement
Build a video classification system for UCF-101 (101 human action classes) that: 1. Loads and preprocesses video clips efficiently 2. Fine-tunes a pre-trained Video Swin Transformer 3. Evaluates using multi-clip, multi-crop inference 4. Analyzes which action categories benefit most from temporal modeling
Dataset
UCF-101 contains 13,320 video clips across 101 action classes including: - Sports (basketball, cricket, diving, etc.) - Playing instruments (guitar, violin, drums, etc.) - Human activities (brushing teeth, cooking, typing, etc.) - Human-object interactions (applying lipstick, using a phone, etc.)
Videos are variable length (2-12 seconds) at 320x240 resolution and 25 FPS. We use the standard three-split evaluation protocol.
Approach
Step 1: Data Loading Pipeline
Building an efficient video data loader is critical:
- Video decoding: Use
torchvision.io.read_videoor thedecordlibrary (3-5x faster than OpenCV) to decode videos. - Temporal sampling: During training, use dense sampling (random 16-frame clip). During evaluation, use uniform sampling (10 clips per video).
- Spatial preprocessing: Resize the shorter side to 256 pixels, then random crop to 224x224 during training and center crop during evaluation.
- Augmentation: Random horizontal flip, color jitter (consistent across frames), and RandAugment adapted for video.
- Normalization: ImageNet mean and standard deviation applied to each frame.
Step 2: Model Architecture
We use Video Swin-Tiny (pre-trained on Kinetics-400): - Patch size: 2x4x4 (temporal x height x width) - Window size: 8x7x7 - Embed dimension: 96 - Depths: [2, 2, 6, 2] for the four stages - Approximately 28M parameters
We replace the classification head (1000 classes for Kinetics) with a new linear layer (768 -> 101 for UCF-101).
Step 3: Training Configuration
- Optimizer: AdamW with learning rate $3 \times 10^{-4}$ for the head and $3 \times 10^{-5}$ for the backbone (10x lower)
- Scheduler: Cosine annealing with 5-epoch warmup
- Epochs: 30
- Batch size: 8 per GPU (with gradient accumulation of 4 for effective batch size 32)
- Mixed precision: FP16 with gradient scaling
- Label smoothing: 0.1
- Dropout: 0.1 in the classification head
- Gradient clipping: Max norm 1.0
Step 4: Multi-Clip Evaluation
During inference: 1. Sample 10 clips uniformly from the video 2. For each clip, apply 3 spatial crops (left, center, right of a 256-pixel shorter side) 3. Forward pass all 30 views through the model 4. Average the softmax probabilities 5. Predict the class with the highest average probability
Step 5: Baselines for Comparison
We compare against: 1. Frame-level baseline: Average per-frame ResNet-50 features + linear classifier 2. CLIP frame averaging: Average per-frame CLIP ViT-B/16 features + linear classifier 3. CLIP + temporal transformer: Per-frame CLIP features processed by a 2-layer temporal transformer
Results
Overall Accuracy (Split 1)
| Method | Top-1 Accuracy | Top-5 Accuracy | Inference Time (per video) |
|---|---|---|---|
| Frame ResNet-50 (avg) | 81.2% | 95.1% | 0.3s |
| CLIP frame avg + linear | 87.4% | 97.2% | 0.5s |
| CLIP + temporal transformer | 89.1% | 97.8% | 0.8s |
| Video Swin-T (1 clip) | 91.3% | 98.4% | 0.4s |
| Video Swin-T (10 clips x 3 crops) | 93.8% | 99.1% | 12s |
Per-Category Analysis
Categories where temporal modeling helps most (>10% improvement over frame-level): - Playing guitar vs. playing ukulele: Requires seeing the strumming motion pattern - Jumping jack vs. body weight squats: Same body pose in some frames - Writing on board vs. applying eye makeup: Arm motion patterns differ
Categories where temporal modeling helps least (<2% improvement): - Pizza tossing: Distinctive visual features in any frame - Rock climbing: Scene context is sufficient - Ice dancing: Unique costumes and setting
Learning Curve
Training loss converges by epoch 15, but validation accuracy continues improving until epoch 25, suggesting regularization is effective. The learning rate warmup is essential — without it, training diverges in the first epoch due to the large learning rate applied to the pre-trained backbone.
Key Lessons
-
Pre-training dominates: The Video Swin Transformer pre-trained on Kinetics-400 provides a massive head start. Training from scratch on UCF-101 alone achieves only ~75% accuracy.
-
Multi-clip evaluation provides 2-3% gains: The improvement is consistent across categories and is essentially free (just more inference time). Always use it for final evaluation.
-
CLIP features are surprisingly strong: Simply averaging CLIP ViT-B/16 frame features achieves 87.4% — competitive with older video-specific architectures — demonstrating the power of large-scale image pre-training.
-
Temporal modeling matters for motion-dependent tasks: The biggest improvements from video transformers come on categories where temporal order is essential. For appearance-dominated categories, frame-level features suffice.
-
Efficient data loading is the bottleneck: Without optimized video decoding (using decord or NVDEC), data loading dominates training time. Prefetching and multi-worker loading are essential.
-
Lower backbone learning rate prevents catastrophic forgetting: Using 10x lower learning rate for the pre-trained backbone preserves spatial representations while allowing the head and temporal modeling to adapt.
Extensions
- Evaluate on Something-Something v2 to test temporal reasoning specifically
- Implement VideoMAE pre-training on UCF-101 and compare with Kinetics-400 pre-training
- Add temporal action detection capabilities by extending the model to handle untrimmed videos
- Implement knowledge distillation from Video Swin-B to Video Swin-T
Code Reference
The complete implementation is available in code/case-study-code.py.