Chapter 30: Key Takeaways

Video Fundamentals

Video is a 5D tensor $(B, C, T, H, W)$ encoding both spatial appearance and temporal dynamics. A single second of 1080p video at 30 FPS contains approximately 186 million values, making computational efficiency a central concern.
Temporal sampling is essential: uniform sampling provides broad coverage for long-term understanding, while dense sampling captures fine-grained motion for short-term dynamics. Standard models process 8-32 frames.
Optical flow estimates per-pixel motion between consecutive frames. RAFT (Recurrent All-Pairs Field Transforms) is the current state-of-the-art deep learning approach, using a correlation volume and iterative GRU refinement.

3D convolutions extend 2D kernels to the temporal dimension with $k_T \times k_H \times k_W$ filters, directly learning spatiotemporal features. The standard kernel size is $3 \times 3 \times 3$.
I3D inflation converts pre-trained 2D CNN weights to 3D by replicating and averaging along the temporal dimension, enabling effective transfer of ImageNet spatial features to video models.
Factorized convolutions reduce cost: R(2+1)D decomposes 3D into spatial $(1 \times 3 \times 3)$ + temporal $(3 \times 1 \times 1)$ convolutions. SlowFast uses two pathways — Slow (low frame rate, high capacity) for semantics and Fast (high frame rate, low capacity) for motion.

TimeSformer applies divided space-time attention: temporal attention (same position across frames) followed by spatial attention (all positions within a frame). This reduces complexity from $O((TN)^2)$ to $O(TN(T+N))$.
ViViT explores four factorization strategies; the factorised encoder (separate spatial and temporal transformers) is most efficient. Tubelet embedding tokenizes video volumes directly, capturing short-range temporal information.
Video Swin Transformer extends 3D shifted windows for efficient local attention with cross-window communication, achieving the best accuracy-efficiency tradeoff among video transformers.

Action recognition is the primary video understanding task, evaluated on Kinetics-400 (current models exceed 80% top-1 accuracy), UCF-101, and Something-Something v2.
Multi-clip, multi-crop evaluation samples $K$ clips with $M$ crops each and averages predictions, improving accuracy by 2-3% over single-clip inference.
VideoMAE applies masked autoencoder pre-training with a 90% masking ratio (exploiting video's high temporal redundancy), producing strong representations for downstream tasks.

Temporal action detection localizes when actions occur in untrimmed videos, outputting $(t_{\text{start}}, t_{\text{end}}, \text{class}, \text{confidence})$ segments. It is evaluated with mAP at various tIoU thresholds (0.3, 0.5, 0.7).
ActionFormer applies multi-scale temporal transformers directly to pre-extracted features, outperforming two-stage proposal-then-classify approaches.

Video captioning extends image captioning to the temporal domain, requiring temporal aggregation of frame features before language generation.
Video-language models (Video-LLaVA, Video-ChatGPT) adapt image-language architectures by replacing the image encoder with video feature extraction, enabling multi-turn conversations about video content.

Diffusion-based video generation adds temporal attention layers to image diffusion U-Nets, jointly denoising all frames for temporal consistency.
Sora represents the frontier: a diffusion transformer operating on spacetime patches that generates up to 60 seconds of high-quality, temporally coherent video.
Evaluation uses FVD (Frechet Video Distance), temporal consistency metrics, and human evaluation. Temporal coherence remains the primary challenge.

For data loading, use hardware-accelerated decoders (decord, NVDEC) and sample during decoding to avoid unnecessary frame processing.
For memory management, use gradient checkpointing, mixed precision (BF16), and gradient accumulation — video models are 8-32x more memory-intensive than image models.
For transfer learning, initialize from a strong image model and add temporal layers (initialized to zero or identity) — this is more data-efficient than training from scratch.
For deployment, consider frame-level features with temporal pooling as a fast baseline before committing to expensive end-to-end video models.