Case Study 1: Video Classification with Video Transformers

Overview

Video classification is the most fundamental video understanding task: given a video clip, predict the action or activity being performed. In this case study, you will build a complete video classification pipeline using a pre-trained Video Swin Transformer, fine-tune it on the UCF-101 dataset, and implement the standard multi-clip evaluation protocol. You will compare the transformer-based approach with simpler baselines and analyze when temporal modeling makes the biggest difference.

Problem Statement

Build a video classification system for UCF-101 (101 human action classes) that: 1. Loads and preprocesses video clips efficiently 2. Fine-tunes a pre-trained Video Swin Transformer 3. Evaluates using multi-clip, multi-crop inference 4. Analyzes which action categories benefit most from temporal modeling

Dataset

UCF-101 contains 13,320 video clips across 101 action classes including: - Sports (basketball, cricket, diving, etc.) - Playing instruments (guitar, violin, drums, etc.) - Human activities (brushing teeth, cooking, typing, etc.) - Human-object interactions (applying lipstick, using a phone, etc.)

Videos are variable length (2-12 seconds) at 320x240 resolution and 25 FPS. We use the standard three-split evaluation protocol.

Approach

Step 1: Data Loading Pipeline

Building an efficient video data loader is critical:

Video decoding: Use torchvision.io.read_video or the decord library (3-5x faster than OpenCV) to decode videos.
Temporal sampling: During training, use dense sampling (random 16-frame clip). During evaluation, use uniform sampling (10 clips per video).
Spatial preprocessing: Resize the shorter side to 256 pixels, then random crop to 224x224 during training and center crop during evaluation.
Augmentation: Random horizontal flip, color jitter (consistent across frames), and RandAugment adapted for video.
Normalization: ImageNet mean and standard deviation applied to each frame.

Step 2: Model Architecture

We use Video Swin-Tiny (pre-trained on Kinetics-400): - Patch size: 2x4x4 (temporal x height x width) - Window size: 8x7x7 - Embed dimension: 96 - Depths: [2, 2, 6, 2] for the four stages - Approximately 28M parameters

We replace the classification head (1000 classes for Kinetics) with a new linear layer (768 -> 101 for UCF-101).

Step 3: Training Configuration

Optimizer: AdamW with learning rate $3 \times 10^{-4}$ for the head and $3 \times 10^{-5}$ for the backbone (10x lower)
Scheduler: Cosine annealing with 5-epoch warmup
Epochs: 30
Batch size: 8 per GPU (with gradient accumulation of 4 for effective batch size 32)
Mixed precision: FP16 with gradient scaling
Label smoothing: 0.1
Dropout: 0.1 in the classification head
Gradient clipping: Max norm 1.0

Step 4: Multi-Clip Evaluation

During inference: 1. Sample 10 clips uniformly from the video 2. For each clip, apply 3 spatial crops (left, center, right of a 256-pixel shorter side) 3. Forward pass all 30 views through the model 4. Average the softmax probabilities 5. Predict the class with the highest average probability

Step 5: Baselines for Comparison

We compare against: 1. Frame-level baseline: Average per-frame ResNet-50 features + linear classifier 2. CLIP frame averaging: Average per-frame CLIP ViT-B/16 features + linear classifier 3. CLIP + temporal transformer: Per-frame CLIP features processed by a 2-layer temporal transformer

Results

Overall Accuracy (Split 1)

Method	Top-1 Accuracy	Top-5 Accuracy	Inference Time (per video)
Frame ResNet-50 (avg)	81.2%	95.1%	0.3s
CLIP frame avg + linear	87.4%	97.2%	0.5s
CLIP + temporal transformer	89.1%	97.8%	0.8s
Video Swin-T (1 clip)	91.3%	98.4%	0.4s
Video Swin-T (10 clips x 3 crops)	93.8%	99.1%	12s

Per-Category Analysis

Categories where temporal modeling helps most (>10% improvement over frame-level): - Playing guitar vs. playing ukulele: Requires seeing the strumming motion pattern - Jumping jack vs. body weight squats: Same body pose in some frames - Writing on board vs. applying eye makeup: Arm motion patterns differ

Categories where temporal modeling helps least (<2% improvement): - Pizza tossing: Distinctive visual features in any frame - Rock climbing: Scene context is sufficient - Ice dancing: Unique costumes and setting

Learning Curve

Training loss converges by epoch 15, but validation accuracy continues improving until epoch 25, suggesting regularization is effective. The learning rate warmup is essential — without it, training diverges in the first epoch due to the large learning rate applied to the pre-trained backbone.

Key Lessons

Pre-training dominates: The Video Swin Transformer pre-trained on Kinetics-400 provides a massive head start. Training from scratch on UCF-101 alone achieves only ~75% accuracy.
Multi-clip evaluation provides 2-3% gains: The improvement is consistent across categories and is essentially free (just more inference time). Always use it for final evaluation.
CLIP features are surprisingly strong: Simply averaging CLIP ViT-B/16 frame features achieves 87.4% — competitive with older video-specific architectures — demonstrating the power of large-scale image pre-training.
Temporal modeling matters for motion-dependent tasks: The biggest improvements from video transformers come on categories where temporal order is essential. For appearance-dominated categories, frame-level features suffice.
Efficient data loading is the bottleneck: Without optimized video decoding (using decord or NVDEC), data loading dominates training time. Prefetching and multi-worker loading are essential.
Lower backbone learning rate prevents catastrophic forgetting: Using 10x lower learning rate for the pre-trained backbone preserves spatial representations while allowing the head and temporal modeling to adapt.

Extensions

Evaluate on Something-Something v2 to test temporal reasoning specifically
Implement VideoMAE pre-training on UCF-101 and compare with Kinetics-400 pre-training
Add temporal action detection capabilities by extending the model to handle untrimmed videos
Implement knowledge distillation from Video Swin-B to Video Swin-T

Code Reference

The complete implementation is available in code/case-study-code.py.