Case Study 1: Video Classification with Video Transformers

Overview

Video classification is the most fundamental video understanding task: given a video clip, predict the action or activity being performed. In this case study, you will build a complete video classification pipeline using a pre-trained Video Swin Transformer, fine-tune it on the UCF-101 dataset, and implement the standard multi-clip evaluation protocol. You will compare the transformer-based approach with simpler baselines and analyze when temporal modeling makes the biggest difference.

Problem Statement

Build a video classification system for UCF-101 (101 human action classes) that: 1. Loads and preprocesses video clips efficiently 2. Fine-tunes a pre-trained Video Swin Transformer 3. Evaluates using multi-clip, multi-crop inference 4. Analyzes which action categories benefit most from temporal modeling

Dataset

UCF-101 contains 13,320 video clips across 101 action classes including: - Sports (basketball, cricket, diving, etc.) - Playing instruments (guitar, violin, drums, etc.) - Human activities (brushing teeth, cooking, typing, etc.) - Human-object interactions (applying lipstick, using a phone, etc.)

Videos are variable length (2-12 seconds) at 320x240 resolution and 25 FPS. We use the standard three-split evaluation protocol.

Approach

Step 1: Data Loading Pipeline

Building an efficient video data loader is critical:

  1. Video decoding: Use torchvision.io.read_video or the decord library (3-5x faster than OpenCV) to decode videos.
  2. Temporal sampling: During training, use dense sampling (random 16-frame clip). During evaluation, use uniform sampling (10 clips per video).
  3. Spatial preprocessing: Resize the shorter side to 256 pixels, then random crop to 224x224 during training and center crop during evaluation.
  4. Augmentation: Random horizontal flip, color jitter (consistent across frames), and RandAugment adapted for video.
  5. Normalization: ImageNet mean and standard deviation applied to each frame.

Step 2: Model Architecture

We use Video Swin-Tiny (pre-trained on Kinetics-400): - Patch size: 2x4x4 (temporal x height x width) - Window size: 8x7x7 - Embed dimension: 96 - Depths: [2, 2, 6, 2] for the four stages - Approximately 28M parameters

We replace the classification head (1000 classes for Kinetics) with a new linear layer (768 -> 101 for UCF-101).

Step 3: Training Configuration

  • Optimizer: AdamW with learning rate $3 \times 10^{-4}$ for the head and $3 \times 10^{-5}$ for the backbone (10x lower)
  • Scheduler: Cosine annealing with 5-epoch warmup
  • Epochs: 30
  • Batch size: 8 per GPU (with gradient accumulation of 4 for effective batch size 32)
  • Mixed precision: FP16 with gradient scaling
  • Label smoothing: 0.1
  • Dropout: 0.1 in the classification head
  • Gradient clipping: Max norm 1.0

Step 4: Multi-Clip Evaluation

During inference: 1. Sample 10 clips uniformly from the video 2. For each clip, apply 3 spatial crops (left, center, right of a 256-pixel shorter side) 3. Forward pass all 30 views through the model 4. Average the softmax probabilities 5. Predict the class with the highest average probability

Step 5: Baselines for Comparison

We compare against: 1. Frame-level baseline: Average per-frame ResNet-50 features + linear classifier 2. CLIP frame averaging: Average per-frame CLIP ViT-B/16 features + linear classifier 3. CLIP + temporal transformer: Per-frame CLIP features processed by a 2-layer temporal transformer

Results

Overall Accuracy (Split 1)

Method Top-1 Accuracy Top-5 Accuracy Inference Time (per video)
Frame ResNet-50 (avg) 81.2% 95.1% 0.3s
CLIP frame avg + linear 87.4% 97.2% 0.5s
CLIP + temporal transformer 89.1% 97.8% 0.8s
Video Swin-T (1 clip) 91.3% 98.4% 0.4s
Video Swin-T (10 clips x 3 crops) 93.8% 99.1% 12s

Per-Category Analysis

Categories where temporal modeling helps most (>10% improvement over frame-level): - Playing guitar vs. playing ukulele: Requires seeing the strumming motion pattern - Jumping jack vs. body weight squats: Same body pose in some frames - Writing on board vs. applying eye makeup: Arm motion patterns differ

Categories where temporal modeling helps least (<2% improvement): - Pizza tossing: Distinctive visual features in any frame - Rock climbing: Scene context is sufficient - Ice dancing: Unique costumes and setting

Learning Curve

Training loss converges by epoch 15, but validation accuracy continues improving until epoch 25, suggesting regularization is effective. The learning rate warmup is essential — without it, training diverges in the first epoch due to the large learning rate applied to the pre-trained backbone.

Key Lessons

  1. Pre-training dominates: The Video Swin Transformer pre-trained on Kinetics-400 provides a massive head start. Training from scratch on UCF-101 alone achieves only ~75% accuracy.

  2. Multi-clip evaluation provides 2-3% gains: The improvement is consistent across categories and is essentially free (just more inference time). Always use it for final evaluation.

  3. CLIP features are surprisingly strong: Simply averaging CLIP ViT-B/16 frame features achieves 87.4% — competitive with older video-specific architectures — demonstrating the power of large-scale image pre-training.

  4. Temporal modeling matters for motion-dependent tasks: The biggest improvements from video transformers come on categories where temporal order is essential. For appearance-dominated categories, frame-level features suffice.

  5. Efficient data loading is the bottleneck: Without optimized video decoding (using decord or NVDEC), data loading dominates training time. Prefetching and multi-worker loading are essential.

  6. Lower backbone learning rate prevents catastrophic forgetting: Using 10x lower learning rate for the pre-trained backbone preserves spatial representations while allowing the head and temporal modeling to adapt.

Extensions

  • Evaluate on Something-Something v2 to test temporal reasoning specifically
  • Implement VideoMAE pre-training on UCF-101 and compare with Kinetics-400 pre-training
  • Add temporal action detection capabilities by extending the model to handle untrimmed videos
  • Implement knowledge distillation from Video Swin-B to Video Swin-T

Code Reference

The complete implementation is available in code/case-study-code.py.