40 min read

> "A video is worth a million words — and a thousand images arranged in time."

Chapter 30: Video Understanding and Generation

"A video is worth a million words — and a thousand images arranged in time."

Video is the richest and most challenging visual medium. Unlike static images, videos encode temporal dynamics — the way objects move, interact, and change over time. A person raising their hand is fundamentally different from a person lowering their hand, yet any single frame from either action might look identical. Understanding video requires not only spatial perception (what is in each frame) but also temporal reasoning (how things change across frames).

This chapter covers the full spectrum of video AI, from foundational representations to state-of-the-art architectures. You will learn how 3D convolutions extend spatial feature extraction to the temporal dimension, how video transformers like TimeSformer and ViViT apply self-attention to spatiotemporal tokens, and how these architectures power applications from action recognition to video captioning. We also explore the emerging frontier of video generation. Throughout, you will implement working systems using PyTorch and Hugging Face.

The concepts in this chapter build directly on the foundations established in earlier chapters. The spatial attention mechanisms extend those from Chapter 26 (Vision Transformers) to the temporal domain. The video diffusion models draw on the framework developed in Chapter 27 (Diffusion Models). And the video-language models combine the multimodal alignment techniques of Chapter 28 with temporal reasoning. Understanding video AI requires synthesizing all of these elements, making this chapter a natural culmination of Part V.


30.1 Video as Temporal Sequences

30.1.1 Digital Video Fundamentals

A digital video is a sequence of frames (images) displayed at a fixed rate. Key parameters include:

  • Frame rate (FPS): Typically 24 (cinema), 25 (PAL), 30 (NTSC), or 60 (high-frame-rate) frames per second.
  • Resolution: Common resolutions include 720p (1280x720), 1080p (1920x1080), and 4K (3840x2160).
  • Duration: Videos range from seconds (clips) to hours (movies).

A single second of 1080p video at 30 FPS contains $30 \times 1920 \times 1080 \times 3 \approx 186$ million values. This enormous data volume makes video processing computationally demanding and necessitates careful design of sampling and compression strategies.

30.1.2 The Temporal Dimension: What Makes Video Different

The fundamental difference between image and video understanding is the temporal dimension. Consider two frames showing a person's hand near a door handle. From a single frame, we cannot tell if the person is opening or closing the door, reaching for the handle, or pulling their hand away. Only by observing the temporal sequence can we determine the action. This is why temporal modeling — understanding how things change over time — is the central technical challenge of video AI.

Different tasks require different temporal modeling granularities: - Action recognition (e.g., "running" vs. "walking") often requires understanding motion patterns over 1-3 seconds. - Activity recognition (e.g., "cooking a meal") requires understanding sequences of actions over minutes. - Story understanding (e.g., summarizing a movie scene) requires tracking characters, objects, and narrative over much longer durations.

30.1.3 Temporal Sampling Strategies

Processing every frame is typically unnecessary and computationally prohibitive. Common sampling strategies include:

  • Uniform sampling: Select $T$ frames uniformly spaced throughout the video. For a video with $N$ total frames and $T$ desired frames, sample frames at indices $\{0, \lfloor N/T \rfloor, 2\lfloor N/T \rfloor, \ldots\}$.
  • Dense sampling: Extract short clips of $T$ consecutive frames from random starting points. Better for capturing fine-grained temporal dynamics.
  • Multi-scale temporal sampling: Sample at multiple temporal resolutions simultaneously to capture both fast and slow motions.

The choice of $T$ (typically 8, 16, or 32 frames) represents a tradeoff between temporal coverage and computational cost.

Worked Example — Sampling strategies compared: For a 10-second video at 30 FPS (300 total frames) with $T = 8$: - Uniform: Sample frames $\{0, 42, 85, 128, 171, 214, 257, 300\}$ — covers the full 10 seconds but misses fast actions between samples (each gap is 1.4 seconds). - Dense: Sample frames $\{120, 121, 122, 123, 124, 125, 126, 127\}$ — captures fine-grained motion at the midpoint but misses the start and end entirely. - Multi-scale: Sample 4 frames uniformly $\{0, 100, 200, 300\}$ plus 4 dense frames around the midpoint $\{148, 149, 150, 151\}$ — balances global coverage with local temporal detail.

For tasks like Something-Something v2 that require fine-grained temporal discrimination (e.g., "pushing from left to right" vs. "pulling from right to left"), dense sampling significantly outperforms uniform sampling. For tasks like Kinetics that involve whole-body actions visible in any frame, uniform sampling works well.

30.1.4 Video as a Tensor

For deep learning, a video clip is represented as a 5D tensor:

$$\mathbf{V} \in \mathbb{R}^{B \times C \times T \times H \times W}$$

where $B$ is the batch size, $C$ is the number of channels (3 for RGB), $T$ is the number of frames, $H$ is the height, and $W$ is the width. This convention (channels first, time before spatial dimensions) is standard in PyTorch.

30.1.4 Optical Flow

Optical flow estimates the apparent motion of each pixel between consecutive frames. Given frames $I_t$ and $I_{t+1}$, the optical flow field $(\mathbf{u}, \mathbf{v})$ assigns a displacement vector to each pixel, indicating where it has moved.

Classical methods: The Lucas-Kanade and Horn-Schunck algorithms compute optical flow based on brightness constancy and spatial smoothness assumptions.

Deep learning methods: FlowNet (Dosovitskiy et al., 2015) and RAFT (Teed & Deng, 2020) use neural networks to estimate flow, achieving superior accuracy. RAFT iteratively refines flow estimates using a correlation volume and a GRU-based update operator.

Optical flow was historically used as a two-stream input to video models (one stream for appearance, one for motion), but modern architectures increasingly learn to extract temporal information directly from RGB frames, making explicit flow computation unnecessary for many tasks.

30.1.5 Motion Estimation: Intuition and Applications

The intuition behind optical flow is straightforward: if a ball moves 5 pixels to the right between frame $t$ and frame $t+1$, the flow vector at the ball's location in frame $t$ is $(u, v) = (5, 0)$. Extending this to every pixel in the image produces a dense flow field that captures the apparent motion of the entire scene.

However, optical flow has important limitations: - Aperture problem: A moving edge seen through a small aperture appears to move only in the direction perpendicular to the edge. The true motion along the edge is ambiguous. - Brightness constancy violations: Changes in lighting, shadows, or reflections violate the brightness constancy assumption, producing erroneous flow estimates. - Occlusion: When an object moves to reveal a previously hidden region, no correct flow exists for the newly visible pixels.

Despite these limitations, optical flow remains valuable for video stabilization, frame interpolation, and as an auxiliary input for action recognition. We will see RAFT's deep learning approach to flow estimation in Section 30.8.

30.1.6 Video Compression and its Impact on AI

Real-world videos are almost always stored in compressed formats (H.264, H.265/HEVC, VP9, AV1). These codecs exploit temporal redundancy by storing only the differences between frames: - I-frames (intra-coded): Independently encoded, like JPEG images - P-frames (predicted): Encoded as differences from a previous frame - B-frames (bidirectional): Encoded using both past and future frames

This compression means that decoding a video is computationally expensive — especially seeking to random positions, which may require decoding from the nearest I-frame. For AI training, this has practical implications: - Random access is slow: Sampling random clips requires seeking to I-frames, then decoding forward to the desired frame. - Efficient data loading is critical: Libraries like decord, torchvision.io, and NVIDIA's DALI provide hardware-accelerated decoding that avoids CPU bottlenecks. - Compressed-domain features: Some research has explored extracting features directly from the compressed video stream (motion vectors and residuals) without full decoding, achieving significant speedups.


30.2 3D Convolutions for Video

30.2.1 From 2D to 3D Convolutions

Standard 2D convolutions operate over spatial dimensions $(H, W)$ with kernels of size $k_H \times k_W$. 3D convolutions extend this to the temporal dimension, using kernels of size $k_T \times k_H \times k_W$ that slide over both space and time simultaneously.

For an input feature map $\mathbf{X} \in \mathbb{R}^{C_{\text{in}} \times T \times H \times W}$, a 3D convolutional layer with $C_{\text{out}}$ filters of size $k_T \times k_H \times k_W$ produces:

$$\mathbf{Y}[c_{\text{out}}, t, h, w] = \sum_{c_{\text{in}}} \sum_{\tau} \sum_{i} \sum_{j} \mathbf{W}[c_{\text{out}}, c_{\text{in}}, \tau, i, j] \cdot \mathbf{X}[c_{\text{in}}, t+\tau, h+i, w+j] + b[c_{\text{out}}]$$

A typical 3D convolutional kernel with $k_T = 3$, $k_H = 3$, $k_W = 3$ captures local spatiotemporal patterns within a 3-frame, 3x3-pixel neighborhood.

30.2.2 C3D: The First 3D CNN

C3D (Tran et al., 2015) was one of the first successful 3D CNN architectures for video. It used a VGG-style architecture with all 2D convolutions replaced by 3D convolutions with $3 \times 3 \times 3$ kernels. Key findings:

  • $3 \times 3 \times 3$ kernels consistently outperformed other kernel shapes.
  • Features from the fc6 layer proved to be effective generic video descriptors.
  • C3D could process 16-frame clips at approximately 300 FPS on a K40 GPU.

30.2.3 I3D: Inflating 2D Networks

I3D (Carreira & Zisserman, 2017) introduced the idea of "inflating" pre-trained 2D CNN weights into 3D. Given a pre-trained 2D kernel $\mathbf{W}_{2D} \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}} \times k \times k}$, the inflated 3D kernel is:

$$\mathbf{W}_{3D}[:, :, \tau, :, :] = \frac{1}{k_T}\mathbf{W}_{2D} \quad \text{for } \tau = 0, 1, \ldots, k_T - 1$$

The division by $k_T$ ensures that the output magnitude matches the 2D network when the input is a static video (identical frames). This initialization bootstraps temporal learning from pre-trained spatial representations, dramatically improving training efficiency.

I3D inflated an Inception-v1 architecture and achieved state-of-the-art results on Kinetics-400, demonstrating that ImageNet pre-training could effectively transfer to video understanding.

30.2.4 Factorized 3D Convolutions

Full 3D convolutions are computationally expensive. Factorization separates spatial and temporal processing:

R(2+1)D (Tran et al., 2018): Decomposes a $k_T \times k_H \times k_W$ convolution into a spatial $1 \times k_H \times k_W$ convolution followed by a temporal $k_T \times 1 \times 1$ convolution. This doubles the number of nonlinearities (activation functions between the two operations) and reduces parameters while often improving accuracy.

SlowFast Networks (Feichtenhofer et al., 2019): Uses two pathways: - Slow pathway: Processes frames at a low frame rate (e.g., 4 FPS) with a large channel capacity, capturing spatial semantics. - Fast pathway: Processes frames at a high frame rate (e.g., 32 FPS) with a small channel capacity (1/8 of Slow), capturing rapid temporal changes.

Lateral connections fuse information from Fast to Slow. This design is inspired by the biological distinction between the parvocellular (slow, high-capacity) and magnocellular (fast, low-capacity) pathways in the primate visual system.

30.2.5 X3D: Efficient Video Networks

X3D (Feichtenhofer, 2020) systematically expanded a tiny base architecture along multiple axes — temporal duration, frame rate, spatial resolution, width, depth, and bottleneck width — to find optimal efficiency tradeoffs. X3D-M achieves comparable accuracy to SlowFast R-101 while being 4.8x more efficient in FLOPs.

30.2.6 Practical Implementation of 3D CNNs

Here is a practical implementation of an I3D-style video classifier in PyTorch:

import torch
import torch.nn as nn


class Simple3DConvBlock(nn.Module):
    """A basic 3D convolutional block with batch norm and ReLU.

    Args:
        in_channels: Number of input channels.
        out_channels: Number of output channels.
        kernel_size: Size of the 3D convolutional kernel.
        stride: Stride of the convolution.
        padding: Padding applied to input.
    """

    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: tuple[int, int, int] = (3, 3, 3),
        stride: tuple[int, int, int] = (1, 1, 1),
        padding: tuple[int, int, int] = (1, 1, 1),
    ) -> None:
        super().__init__()
        self.conv = nn.Conv3d(
            in_channels, out_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=padding,
            bias=False,
        )
        self.bn = nn.BatchNorm3d(out_channels)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.relu(self.bn(self.conv(x)))


class SimpleVideoClassifier(nn.Module):
    """Simple 3D CNN video classifier for illustration.

    Args:
        num_classes: Number of action classes.
        in_channels: Number of input channels (3 for RGB).
    """

    def __init__(
        self,
        num_classes: int = 400,
        in_channels: int = 3,
    ) -> None:
        super().__init__()
        self.features = nn.Sequential(
            Simple3DConvBlock(in_channels, 64, stride=(1, 2, 2)),
            Simple3DConvBlock(64, 128, stride=(2, 2, 2)),
            Simple3DConvBlock(128, 256, stride=(2, 2, 2)),
            Simple3DConvBlock(256, 512, stride=(2, 2, 2)),
        )
        self.pool = nn.AdaptiveAvgPool3d(1)
        self.classifier = nn.Linear(512, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Classify a video clip.

        Args:
            x: Video tensor [batch, channels, time, height, width].

        Returns:
            Class logits [batch, num_classes].
        """
        features = self.features(x)
        pooled = self.pool(features).flatten(1)
        return self.classifier(pooled)


# Example
model = SimpleVideoClassifier(num_classes=101)
video = torch.randn(2, 3, 16, 112, 112)  # 2 clips, 16 frames, 112x112
logits = model(video)
print(f"Output shape: {logits.shape}")  # [2, 101]

30.3 Video Transformers

30.3.1 The Challenge of Spatiotemporal Attention

Applying a standard ViT to video faces a severe computational challenge. A 16-frame video at 224x224 with 16x16 patches produces $16 \times 196 = 3{,}136$ tokens. Full self-attention over 3,136 tokens has complexity $O(3136^2) \approx O(10^7)$ per layer, which is roughly 256x more expensive than a single-image ViT.

To put this concretely: the attention matrix for a single head with 3,136 tokens requires $3136^2 \times 4$ bytes $\approx$ 39 MB in float32. With 12 heads, that is 470 MB for the attention maps alone in a single layer. For 12 layers, this exceeds 5 GB — just for the attention maps, not counting gradients, activations, or model weights. This quickly becomes untenable for training.

This has motivated a variety of factorized attention strategies that decompose the full spatiotemporal attention into more manageable components. The fundamental insight is that not all token pairs need to interact directly. Spatial tokens within the same frame need to interact (for understanding the scene), and temporal tokens at the same position need to interact (for understanding motion), but a patch in the top-left of frame 1 rarely needs direct interaction with a patch in the bottom-right of frame 16.

30.3.2 TimeSformer

TimeSformer (Bertasius et al., 2021) from Facebook AI explored five different spatiotemporal attention schemes:

  1. Space-only attention (S): Each frame is processed independently. No temporal modeling.
  2. Joint space-time attention (ST): Full attention over all spatiotemporal tokens. Maximum expressiveness but quadratic in $T \times N$.
  3. Divided space-time attention (T+S): Alternates between temporal attention (each spatial token attends to the same spatial position across all frames) and spatial attention (each frame's tokens attend to each other independently). This is the recommended variant.
  4. Sparse local-global attention: Combines local spatial attention with global temporal attention.
  5. Axial attention: Attention along each axis (time, height, width) separately.

Divided space-time attention achieves the best accuracy-efficiency tradeoff:

For each transformer layer: 1. Temporal attention: Token at position $(t, i)$ attends to tokens $\{(1, i), (2, i), \ldots, (T, i)\}$ — the same spatial position across all frames. 2. Spatial attention: Token at position $(t, i)$ attends to tokens $\{(t, 1), (t, 2), \ldots, (t, N)\}$ — all spatial positions within the same frame.

This reduces complexity from $O((TN)^2)$ to $O(TN(T + N))$.

30.3.3 ViViT: Video Vision Transformer

ViViT (Arnab et al., 2021) from Google explored four design variants:

Model 1 — Spatiotemporal attention: Flatten all spatiotemporal tokens and apply full ViT. Simple but expensive.

Model 2 — Factorised encoder: Two separate transformer encoders. The spatial encoder processes each frame independently, producing per-frame CLS tokens. The temporal encoder then processes the sequence of CLS tokens across time. This is highly efficient but limits cross-frame spatial interactions.

Model 3 — Factorised self-attention: Similar to TimeSformer's divided attention, alternating spatial and temporal attention within each layer.

Model 4 — Factorised dot-product attention: Computes spatial and temporal attention in parallel and fuses the results.

ViViT also introduced tubelet embedding as an alternative to frame-level patch embedding. A tubelet of size $t_T \times t_H \times t_W$ extracts a 3D patch from the video, directly capturing short-range temporal information in the tokenization step.

30.3.4 Tubelet Embedding: Tokenizing Video

Before examining more architectures, it is worth pausing on the tokenization step, which is the video equivalent of ViT's patch embedding (as we saw in Section 26.2.1). There are two main approaches:

Frame-level patch embedding: Each frame is independently divided into patches, producing $T \times N$ tokens (where $N$ is the number of patches per frame). This is simple but does not capture short-range temporal information in the tokenization.

Tubelet embedding: A 3D patch (or "tubelet") of size $t_T \times t_H \times t_W$ extracts a spatiotemporal volume. For example, with $t_T = 2$, $t_H = 16$, $t_W = 16$, each token spans 2 consecutive frames and a 16x16 spatial region. This directly captures short-range temporal information at the tokenization stage.

The tubelet approach is equivalent to a 3D convolution with kernel size $(t_T, t_H, t_W)$ and stride $(t_T, t_H, t_W)$:

import torch
import torch.nn as nn


class TubeletEmbedding(nn.Module):
    """Convert a video clip into a sequence of tubelet embeddings.

    Args:
        tube_t: Temporal size of each tubelet.
        tube_h: Height of each tubelet.
        tube_w: Width of each tubelet.
        in_channels: Number of input channels (3 for RGB).
        embed_dim: Embedding dimension.
    """

    def __init__(
        self,
        tube_t: int = 2,
        tube_h: int = 16,
        tube_w: int = 16,
        in_channels: int = 3,
        embed_dim: int = 768,
    ) -> None:
        super().__init__()
        self.proj = nn.Conv3d(
            in_channels, embed_dim,
            kernel_size=(tube_t, tube_h, tube_w),
            stride=(tube_t, tube_h, tube_w),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Project video tubelets into embedding space.

        Args:
            x: Video tensor [batch, channels, time, height, width].

        Returns:
            Token embeddings [batch, num_tokens, embed_dim].
        """
        x = self.proj(x)              # [B, E, T', H', W']
        x = x.flatten(2).transpose(1, 2)  # [B, T'*H'*W', E]
        return x


# Example: 16 frames, 224x224, with 2x16x16 tubelets
video = torch.randn(2, 3, 16, 224, 224)
embed = TubeletEmbedding(tube_t=2, tube_h=16, tube_w=16)
tokens = embed(video)
# num_tokens = (16/2) * (224/16) * (224/16) = 8 * 14 * 14 = 1568
print(f"Token shape: {tokens.shape}")  # [2, 1568, 768]

Worked Example: For a 16-frame video at 224x224 resolution: - Frame-level patches (16x16): $16 \times 14 \times 14 = 3{,}136$ tokens - Tubelets (2x16x16): $8 \times 14 \times 14 = 1{,}568$ tokens (2x reduction) - Tubelets (4x16x16): $4 \times 14 \times 14 = 784$ tokens (4x reduction)

The temporal stride in tubelet embedding provides a natural way to control the computational cost of the subsequent transformer layers, which scale quadratically with token count.

30.3.5 Video Swin Transformer

Video Swin Transformer (Liu et al., 2022) extends the Swin Transformer to video by computing 3D shifted window attention:

  • Windows are 3D: $T_w \times M \times M$ (e.g., $2 \times 7 \times 7$)
  • Shifted windows alternate with regular windows, just as in 2D Swin
  • Relative position bias is extended to 3D

This achieves linear complexity in the number of spatiotemporal tokens while maintaining the ability to model cross-window interactions through shifting.

30.3.5 Comparison of Video Architectures

Architecture Temporal Modeling Complexity Pre-training Top-1 on Kinetics-400
I3D 3D convolution O(THW) ImageNet inflation 72.1%
SlowFast R-101 Dual pathway O(THW) From scratch 79.8%
TimeSformer-L Divided attention O(TN(T+N)) ImageNet-21K 80.7%
ViViT-L (Model 2) Factorised encoder O(TN^2 + T^2) ImageNet-21K 81.3%
Video Swin-B 3D shifted windows O(TNM^2) ImageNet-21K 82.7%

30.4 Video Classification

30.4.1 Action Recognition

Action recognition — classifying a video clip into an action category — is the most studied video understanding task. Given a video clip $\mathbf{V} \in \mathbb{R}^{T \times H \times W \times 3}$, the model predicts a class label $y$ from a set of action classes.

Standard benchmarks include: - Kinetics-400/600/700: 400/600/700 human action classes, 10-second clips from YouTube. The de facto standard benchmark. - UCF-101: 101 action classes, 13K clips. Smaller and more commonly used for quick experiments. - Something-Something v2: 174 fine-grained hand gesture classes. Requires temporal reasoning (distinguishing "pushing something from left to right" vs. "pushing something from right to left"). - Moments in Time: 339 classes covering diverse activities, events, and actions.

30.4.2 Multi-Clip Evaluation

During inference, multiple clips are typically sampled from the video and their predictions averaged:

  1. Sample $K$ clips uniformly from the video (e.g., $K = 10$).
  2. For each clip, apply $M$ spatial crops (typically 3: left, center, right).
  3. Average all $K \times M$ predictions to get the final class distribution.

This multi-clip, multi-crop evaluation reduces variance and typically improves accuracy by 1-3% over single-clip evaluation, at the cost of $K \times M$ times more computation.

30.4.3 Two-Stream Approaches (Historical)

The two-stream architecture (Simonyan & Zisserman, 2014) processes appearance and motion separately:

  • Spatial stream: A 2D CNN processing a single RGB frame for appearance features.
  • Temporal stream: A 2D CNN processing a stack of optical flow frames for motion features.

Predictions from both streams are combined (late fusion). This approach was dominant before 3D CNNs and transformers demonstrated that temporal information could be learned directly from RGB frames.

30.4.4 Video Foundation Models

Recent video foundation models achieve state-of-the-art performance through large-scale pre-training:

  • VideoMAE (Tong et al., 2022): Masked autoencoder pre-training for video transformers. Uses 90% masking ratio (higher than MAE's 75% for images) because temporal redundancy makes video even easier to reconstruct from partial information. The key insight is that adjacent frames share so much information that masking 90% of spatiotemporal tokens still leaves enough visible context for meaningful reconstruction.
  • InternVideo (Wang et al., 2022): Combines masked video modeling with video-text contrastive learning for versatile video representations. This dual-objective approach produces features that are strong for both classification (from the masked modeling) and retrieval (from the contrastive learning).
  • UMT (Unified Multimodal Transformer): Pre-trains a single model on image-text, video-text, and video-only data, demonstrating that unified pre-training across modalities and data types improves video understanding.

30.4.5 Practical Video Classification with HuggingFace

Here is a practical implementation of video classification using a pre-trained VideoMAE model:

import torch
import numpy as np
from transformers import (
    VideoMAEForVideoClassification,
    VideoMAEImageProcessor,
)


def classify_video(
    video_frames: np.ndarray,
    model_name: str = "MCG-NJU/videomae-base-finetuned-kinetics",
    top_k: int = 5,
) -> list[dict[str, float]]:
    """Classify a video clip using VideoMAE.

    Args:
        video_frames: Array of shape [num_frames, height, width, 3],
                      with pixel values in [0, 255].
        model_name: HuggingFace model identifier.
        top_k: Number of top predictions to return.

    Returns:
        List of dicts with 'label' and 'score' keys.
    """
    processor = VideoMAEImageProcessor.from_pretrained(model_name)
    model = VideoMAEForVideoClassification.from_pretrained(model_name)

    # VideoMAE expects 16 frames; sample uniformly if needed
    num_frames = video_frames.shape[0]
    target_frames = model.config.num_frames
    if num_frames != target_frames:
        indices = np.linspace(0, num_frames - 1, target_frames, dtype=int)
        video_frames = video_frames[indices]

    # Convert to list of frames for the processor
    frames_list = [frame for frame in video_frames]
    inputs = processor(frames_list, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    probs = torch.softmax(logits, dim=-1)
    top_probs, top_indices = probs.topk(top_k)

    results = []
    for prob, idx in zip(top_probs[0], top_indices[0]):
        results.append({
            "label": model.config.id2label[idx.item()],
            "score": prob.item(),
        })
    return results


# Usage with decord for efficient video loading:
# from decord import VideoReader
# vr = VideoReader("action_clip.mp4")
# frames = vr.get_batch(range(0, len(vr), len(vr)//16)).asnumpy()
# predictions = classify_video(frames)
# for pred in predictions:
#     print(f"{pred['label']}: {pred['score']:.3f}")

30.4.6 Computational Budget Analysis

Video AI is exceptionally compute-hungry. Let us analyze the costs:

Storage: A 10-second 1080p video at 30 FPS occupies $300 \times 1920 \times 1080 \times 3 = 1.86$ billion values as raw tensors (about 7.4 GB in float32). Compressed (H.264), the same video is typically 5-20 MB.

Computation: Processing 16 frames through a ViT-Base with 196 patches per frame (3,136 total tokens) requires about 16x the FLOPs of processing a single image. With full spatiotemporal attention, this grows to approximately $16^2 = 256$x due to the quadratic attention cost.

Memory: Training a Video Swin-B model on 32-frame 224x224 clips with batch size 8 requires approximately 40 GB of GPU memory. Gradient checkpointing reduces this to about 20 GB at the cost of ~30% more training time.

These costs motivate the various efficiency strategies discussed throughout this chapter: factorized attention, tubelet embedding, temporal sampling, and mixed precision training.


30.5 Temporal Action Detection

30.5.1 Problem Definition

Unlike classification (which assigns a label to an entire clip), temporal action detection identifies when actions occur within untrimmed videos. Given a long video, the model must output a set of temporal segments $\{(t_{\text{start}}^i, t_{\text{end}}^i, c^i, s^i)\}$, where each segment has a start time, end time, action class, and confidence score.

30.5.2 Two-Stage Approaches

Traditional methods follow a two-stage pipeline:

  1. Proposal generation: Generate candidate temporal segments using methods like: - Temporal sliding windows: Multi-scale windows at regular intervals. - Actionness scoring: A network scores each temporal position for "action likelihood," and proposals are generated around high-scoring regions. - BSN/BMN (Boundary-Sensitive Networks): Predict start and end probabilities for each temporal position, then combine them to form proposals.

  2. Proposal classification: Each proposal is classified using a video classifier and refined with temporal regression.

30.5.3 One-Stage Approaches

Inspired by one-stage object detectors, ActionFormer (Zhang et al., 2022) applies a transformer directly to temporal features:

  1. Extract per-frame features using a pre-trained video backbone.
  2. Apply a multi-scale temporal transformer to model long-range dependencies.
  3. At each temporal position and scale, predict action class and temporal boundaries.

This approach is simpler, faster, and often more accurate than two-stage methods.

30.5.4 Evaluation Metrics

Temporal action detection is evaluated using mean Average Precision (mAP) at different temporal Intersection-over-Union (tIoU) thresholds. A prediction is correct if its tIoU with a ground truth segment exceeds the threshold. Standard thresholds are 0.3, 0.5, and 0.7, with mAP@0.5 being the most commonly reported.

The temporal IoU between a predicted segment $(t_s^p, t_e^p)$ and a ground truth segment $(t_s^g, t_e^g)$ is:

$$\text{tIoU} = \frac{\max(0, \min(t_e^p, t_e^g) - \max(t_s^p, t_s^g))}{\max(t_e^p, t_e^g) - \min(t_s^p, t_s^g)}$$

Worked Example: If the ground truth is "drinking coffee" from 5.0s to 8.0s, and the model predicts 4.5s to 7.5s: the intersection is $\max(0, 7.5 - 5.0) = 2.5$s, the union is $8.0 - 4.5 = 3.5$s, so tIoU = $2.5 / 3.5 = 0.714$. This exceeds the 0.5 threshold (correct at mAP@0.5) and the 0.7 threshold (correct at mAP@0.7).


30.6 Video Captioning

30.6.1 Architecture

Video captioning generates a natural language description of a video. The architecture mirrors image captioning but must handle temporal information:

  1. Video encoder: Extract features from sampled frames using a video backbone (e.g., TimeSformer, ViViT) or per-frame features from a 2D backbone (CLIP ViT) with temporal aggregation.
  2. Temporal modeling: Aggregate frame-level features into a video-level representation using temporal attention, pooling, or a temporal transformer.
  3. Language decoder: Generate text autoregressively, conditioned on the video features through cross-attention.

30.6.2 Dense Video Captioning

Dense video captioning (Krishna et al., 2017) combines temporal action detection with captioning: the model must both localize events and describe each one. The ActivityNet Captions dataset contains ~100K descriptions for ~20K videos.

30.6.3 Video-Language Models

Modern video-language models adapt image-language architectures (discussed in Chapter 28) to video:

  • Video-LLaVA: Extends the LLaVA architecture (Section 28.6) by replacing the image encoder with a video encoder and training on video instruction data. The key design choice is how to represent the video: Video-LLaVA extracts per-frame features using a ViT encoder and concatenates all frame tokens as input to the LLM, producing a very long token sequence.
  • Video-ChatGPT: Takes a more memory-efficient approach by adding temporal and spatial pooling to create a compact set of video tokens from frame-level CLIP features. Specifically, it applies average pooling across the temporal dimension (collapsing T frames into 1 representation per spatial position) and across the spatial dimension (collapsing all spatial positions into 1 representation per frame), then concatenates both to form the video representation. This reduces the token count from $T \times N$ to $T + N$, enabling multi-turn conversation about videos within a standard context window.
  • LLaVA-NeXT-Video: Uses the AnyRes technique from LLaVA-NeXT to handle both images and multi-frame video within a unified architecture, treating video frames similarly to image tiles.

The core challenge for all video-language models is the token budget: a 32-frame video with 576 tokens per frame produces 18,432 visual tokens — far exceeding most LLMs' context windows. All practical systems must compress the video representation, either through temporal pooling, spatial pooling, Q-Former-style bottlenecks, or token selection strategies.


30.7 Video Generation

30.7.1 The Challenge of Temporal Consistency

Generating video is dramatically harder than generating images because consecutive frames must be temporally consistent — objects should move smoothly, lighting should change gradually, and the scene should maintain physical coherence. Generating frames independently produces flickering, teleporting objects, and physically impossible dynamics.

To appreciate the difficulty, consider that a 4-second video at 24 FPS and 512x512 resolution contains $24 \times 4 \times 512 \times 512 \times 3 \approx 75$ million pixel values, all of which must be coherent both spatially (each frame is a valid image) and temporally (consecutive frames form a plausible sequence). The space of valid videos is an extraordinarily thin manifold within this 75-million-dimensional space, making generation far more constrained than image generation.

The challenge is further compounded by human sensitivity to temporal artifacts. We easily detect: - Flickering: Rapid changes in brightness or color between frames - Morphing: Objects gradually changing shape in unnatural ways - Teleportation: Objects appearing to jump between positions - Physics violations: Objects defying gravity, passing through each other, or changing mass

Even small inconsistencies that would be imperceptible in a single image become glaringly obvious when viewed as a video sequence. This places much higher demands on temporal coherence than on per-frame quality.

30.7.2 GAN-Based Video Generation (Historical)

Early approaches extended GANs to video: - VGAN and TGAN: Separate content and motion generation using a latent code that decomposes into a time-invariant content vector and a time-varying motion vector. - MoCoGAN: Decomposes video into content (static) and motion (dynamic) latent codes, with the motion code modeled as a random walk in latent space, producing smooth temporal evolution. - DVD-GAN: Dual discriminator (spatial + temporal) for higher-resolution video. The spatial discriminator evaluates individual frames, while the temporal discriminator evaluates the coherence of frame sequences.

These approaches were limited to very short (16-64 frame), low-resolution (64x64 to 256x256) videos with simple dynamics. The fundamental limitation was the same as for image GANs — mode collapse, training instability, and limited diversity — but amplified by the temporal dimension. A mode-collapsed video GAN might generate the same motion pattern for every sample, or produce temporally incoherent sequences that look reasonable frame-by-frame but fail as videos.

The shift to diffusion-based video generation, discussed below, addressed these stability issues while dramatically improving quality and temporal coherence.

30.7.3 Diffusion-Based Video Generation

Diffusion models have revolutionized video generation, as they did for images (see Chapter 27). The key challenge is extending the image diffusion framework to the temporal dimension while maintaining consistency.

Video Diffusion Models (Ho et al., 2022): Extends image diffusion by adding temporal attention layers to the U-Net. The 3D U-Net architecture alternates between: - Spatial attention/convolution (processing each frame independently) - Temporal attention/convolution (processing across frames at each spatial position)

The forward process adds noise independently to each frame, and the reverse process jointly denoises all frames, with temporal attention ensuring consistency. The key insight is that the temporal attention layers learn to coordinate the denoising across frames — if one frame is denoised to show a ball on the left, adjacent frames should show the ball at nearby positions.

Stable Video Diffusion (SVD): Builds on Stable Diffusion's architecture, adding temporal convolution and attention layers. SVD takes a conditioning image and generates a short video clip (14-25 frames) showing plausible motion. The training procedure involves three stages: (1) pre-train the image model, (2) fine-tune on video data with temporal layers, (3) fine-tune on high-quality curated video data for improved aesthetics.

Sora (OpenAI, 2024): Represents a step change in video generation quality and duration. Rather than using a U-Net, Sora uses a diffusion transformer (DiT) architecture operating on spacetime patches — building on the DiT work discussed in Section 27.14. Key architectural concepts include:

  1. Spacetime patch embeddings: The input video (or latent representation of a video) is divided into 3D patches that span both space and time, similar to the tubelet embedding we discussed in Section 30.3.4. These patches are linearly embedded into tokens.

  2. Transformer backbone: A standard transformer (not a U-Net) processes these spatiotemporal tokens with full self-attention. The transformer architecture scales more predictably than U-Nets, following the scaling laws observed in language models.

  3. Variable resolution and duration: Sora is trained on videos of diverse durations (up to 60 seconds), resolutions, and aspect ratios by using NaViT-style packing (as mentioned in Section 26.10.4). This enables generation at any desired aspect ratio without distortion.

  4. Recaptioning: Training videos are re-captioned with detailed descriptions generated by a multimodal model (similar to the BLIP-2 approach from Section 28.4.2), significantly improving text-video alignment. Short, vague captions are replaced with detailed descriptions of content, motion, camera angles, and aesthetics.

While the full details of Sora's architecture have not been published, the general principles — transformer-based diffusion on spatiotemporal patches with recaptioned training data — have been validated by subsequent open-source projects like Open-Sora.

30.7.4 The Temporal Consistency Challenge

Temporal consistency remains the central challenge in video generation. Several techniques address it:

  • Temporal attention: The most common approach, where tokens at the same spatial position across frames attend to each other. This encourages smooth motion but can still produce flickering for complex scenes.
  • Motion modules: Separate networks that model the dynamics of the scene, predicting how each spatial region should move over time. AnimateDiff adds plug-and-play motion modules to existing text-to-image models.
  • Flow-based warping: Use predicted optical flow to warp the previous frame, then denoise only the residual. This ensures pixel-level consistency but struggles with large motions and occlusions.
  • Temporal super-resolution: Generate keyframes first, then interpolate between them to fill in the gaps. This ensures long-range consistency in the keyframes while using a lightweight model for interpolation.

30.7.5 Autoregressive Video Generation

30.7.6 Autoregressive Video Generation

An alternative approach generates video frames autoregressively: - Generate frame 1 conditioned on text - Generate frame 2 conditioned on text + frame 1 - Generate frame $t$ conditioned on text + frames $1, \ldots, t-1$

This naturally ensures temporal consistency but is slow and can accumulate errors over long sequences. VideoPoet from Google uses a tokenizer to convert video frames into discrete tokens and then applies an autoregressive transformer.

30.7.7 Evaluation of Video Generation

Video generation quality is assessed using: - FVD (Frechet Video Distance): Extension of FID to video, comparing distributions of generated and real video features. - IS (Inception Score): Applied per-frame, then averaged. - CLIPSIM: CLIP similarity between generated video frames and the text prompt. - Temporal consistency metrics: Optical flow smoothness, warping error between consecutive frames. - Human evaluation: Side-by-side preference studies remain the gold standard.

30.7.8 Image-to-Video Generation

An increasingly popular paradigm is image-to-video (I2V) generation, where a single static image serves as the first frame and the model generates subsequent frames showing plausible motion:

  1. Encode the conditioning image through the VAE encoder to obtain its latent representation.
  2. Concatenate this latent with noise for the remaining frames (or use it as a conditioning signal through cross-attention).
  3. Denoise to produce a coherent video sequence.

Stable Video Diffusion (SVD) is the leading open-source I2V model. It takes a single image and generates 14 or 25 frames of motion at 576x1024 resolution. Applications include: - Product visualization: Animate a product photo to show it in use - Creative tools: Turn illustrations or photos into short animations - Previsualization: Generate motion concepts from storyboard frames

The quality of I2V generation has improved dramatically — models can now produce realistic camera motion (panning, zooming), object motion (walking, flowing water), and even complex interactions, though physics violations still occur frequently.


30.8 Optical Flow: Deep Dive

30.8.1 The Brightness Constancy Equation

The fundamental assumption of optical flow is brightness constancy: a pixel's intensity does not change between frames:

$$I(x, y, t) = I(x + u, y + v, t + 1)$$

where $(u, v)$ is the flow vector. A first-order Taylor expansion yields the optical flow constraint equation:

$$I_x u + I_y v + I_t = 0$$

where $I_x$, $I_y$, and $I_t$ are the spatial and temporal image gradients. This single equation has two unknowns, requiring additional constraints (smoothness, local constancy) for a unique solution.

30.8.2 RAFT: Recurrent All-Pairs Field Transforms

RAFT (Teed & Deng, 2020) is the dominant deep learning approach to optical flow:

  1. Feature extraction: A shared CNN extracts features from both frames.
  2. Correlation volume: Compute all-pairs dot products between features from the two frames, creating a 4D correlation volume.
  3. Iterative refinement: A GRU-based update operator iteratively refines the flow estimate by indexing into the correlation volume at the current flow estimate's location.

RAFT achieves state-of-the-art accuracy on Sintel and KITTI benchmarks while being efficient enough for practical use.

Worked Example — RAFT dimensions: For two frames of size 480x640: 1. Feature extraction produces features of size 120x160x256 (1/4 resolution). 2. The 4D correlation volume has size $120 \times 160 \times 120 \times 160 \approx 369$ million entries — too large to store. RAFT constructs a correlation pyramid at multiple scales and indexes into it at the current flow estimate's location, requiring only $O(HW)$ memory. 3. The GRU update operator runs for 12-32 iterations, each refining the flow estimate by predicting a small update $\Delta\mathbf{f}$. 4. The final flow field has size 480x640x2 (horizontal and vertical displacement for each pixel).

30.8.3 Practical Optical Flow Computation

import torch
from torchvision.models.optical_flow import raft_large, Raft_Large_Weights
import torchvision.transforms.functional as F


def compute_optical_flow(
    frame1: torch.Tensor,
    frame2: torch.Tensor,
) -> torch.Tensor:
    """Compute optical flow between two frames using RAFT.

    Args:
        frame1: First frame tensor [3, H, W] with values in [0, 1].
        frame2: Second frame tensor [3, H, W] with values in [0, 1].

    Returns:
        Flow field tensor [2, H, W] (horizontal, vertical displacement).
    """
    weights = Raft_Large_Weights.DEFAULT
    transforms = weights.transforms()

    model = raft_large(weights=weights, progress=False)
    model = model.eval()

    # Preprocess
    img1_batch, img2_batch = transforms(
        frame1.unsqueeze(0), frame2.unsqueeze(0)
    )

    with torch.no_grad():
        # RAFT returns a list of flow predictions (one per iteration)
        flow_predictions = model(img1_batch, img2_batch)

    # Take the final (most refined) prediction
    flow = flow_predictions[-1].squeeze(0)  # [2, H, W]
    return flow


# The flow field can be visualized as a color image:
# - Hue encodes direction of motion
# - Saturation encodes magnitude of motion

30.8.4 Applications of Optical Flow

  • Video stabilization: Compensate for camera shake using global flow estimation.
  • Frame interpolation: Generate intermediate frames by warping using flow.
  • Action recognition: Flow provides explicit motion cues that complement appearance.
  • Video editing: Propagate edits across frames using flow-based warping.
  • Object tracking: Estimate object motion between frames.

30.9 Video Understanding Pipeline: Practical Implementation

30.9.1 Data Loading and Preprocessing

Efficient video data loading is critical for training performance. Key considerations:

  • Decoding: Use hardware-accelerated decoders (NVDEC via PyTorch's torchvision.io or decord) to avoid CPU bottlenecks.
  • Temporal sampling: Sample frames during loading, not after, to avoid decoding unnecessary frames.
  • Spatial augmentation: Apply consistent augmentations across all frames in a clip (same random crop, same color jitter parameters).
  • Temporal augmentation: Random temporal offset for clip start position; temporal jittering.

30.9.2 Memory-Efficient Training

Video models require substantial GPU memory. Practical strategies:

  • Gradient checkpointing: Trade compute for memory by recomputing activations during the backward pass.
  • Mixed precision (FP16/BF16): Reduce memory by 50% with minimal accuracy impact.
  • Gradient accumulation: Simulate large batch sizes with limited memory.
  • Frame dropping: Randomly drop frames during training as augmentation (also saves memory).

30.9.3 Transfer Learning for Video

The most effective approach is to initialize from a strong image model and adapt to video:

  1. Start with a pre-trained ViT (e.g., from CLIP or DINOv2).
  2. Add temporal attention layers (initialized to zero or identity).
  3. Fine-tune on the target video dataset.

This leverages the rich spatial representations from image pre-training while learning temporal dynamics from video data. The zero-initialization of temporal layers is the same principle used in Flamingo's gated cross-attention (Section 28.7.3) and ControlNet's zero convolutions (Section 27.10.2): at initialization, the model behaves identically to the pre-trained image model, and the temporal capability is learned gradually.

30.9.4 End-to-End Video Pipeline Example

Here is a complete example showing video loading, preprocessing, and classification:

import torch
import numpy as np
from decord import VideoReader, cpu


def load_video_clip(
    video_path: str,
    num_frames: int = 16,
    target_size: tuple[int, int] = (224, 224),
) -> torch.Tensor:
    """Load and preprocess a video clip for model input.

    Args:
        video_path: Path to video file.
        num_frames: Number of frames to sample.
        target_size: (height, width) to resize frames to.

    Returns:
        Video tensor of shape [channels, num_frames, height, width].
    """
    vr = VideoReader(video_path, ctx=cpu(0))
    total_frames = len(vr)

    # Uniform temporal sampling
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
    frames = vr.get_batch(indices).asnumpy()  # [T, H, W, C]

    # Resize frames
    import torchvision.transforms.functional as F
    from PIL import Image

    processed = []
    for frame in frames:
        img = Image.fromarray(frame)
        img = F.resize(img, target_size)
        tensor = F.to_tensor(img)  # [C, H, W], values in [0, 1]
        # Normalize with ImageNet stats
        tensor = F.normalize(
            tensor, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
        )
        processed.append(tensor)

    # Stack: [T, C, H, W] -> [C, T, H, W]
    video_tensor = torch.stack(processed, dim=0)  # [T, C, H, W]
    video_tensor = video_tensor.permute(1, 0, 2, 3)  # [C, T, H, W]
    return video_tensor


def temporal_augmentation(
    video: torch.Tensor,
    speed_range: tuple[float, float] = (0.8, 1.2),
) -> torch.Tensor:
    """Apply temporal augmentation by randomly changing playback speed.

    Args:
        video: Video tensor [C, T, H, W].
        speed_range: Range of speed multipliers.

    Returns:
        Temporally augmented video tensor [C, T, H, W].
    """
    c, t, h, w = video.shape
    speed = np.random.uniform(*speed_range)
    new_t = int(t / speed)

    # Resample frames
    orig_indices = np.arange(t)
    new_indices = np.linspace(0, t - 1, new_t)

    # Simple nearest-neighbor temporal resampling
    resampled_indices = np.round(new_indices).astype(int)
    resampled_indices = np.clip(resampled_indices, 0, t - 1)

    resampled = video[:, resampled_indices, :, :]

    # Pad or crop back to original length
    if new_t < t:
        padding = torch.zeros(c, t - new_t, h, w)
        resampled = torch.cat([resampled, padding], dim=1)
    elif new_t > t:
        resampled = resampled[:, :t, :, :]

    return resampled

30.10 Advanced Topics

30.10.1 Long-Form Video Understanding

Most video models process short clips (a few seconds), but real-world video understanding often requires reasoning over much longer durations. A 90-minute movie at 24 FPS contains 129,600 frames. Even with aggressive temporal sampling (1 frame per second), this produces 5,400 frames — far beyond what any current video transformer can process in a single pass. Understanding long-form videos (minutes to hours) requires:

  • Hierarchical approaches: Process short clips, then aggregate clip-level features.
  • Memory mechanisms: Maintain a memory bank that stores relevant information from earlier parts of the video.
  • Sparse attention: Attend to a subset of frames rather than all frames.

Benchmarks like EgoSchema (5-minute egocentric videos with multiple-choice questions) and MovieChat (long movie understanding with conversational QA) test these capabilities. EgoSchema is particularly challenging because answers often depend on understanding the full 5-minute context — a model that only sees a few seconds will miss the information needed to answer correctly.

The long-form video understanding problem is analogous to the long-context challenge in language models (discussed in Chapter 14). Just as language models struggle to reason over documents exceeding their context window, video models struggle with temporal spans beyond their processing capacity. Solutions in both domains share common themes: hierarchical processing, memory mechanisms, and retrieval-augmented approaches that selectively access relevant content.

30.10.2 Video-Text Retrieval

Video-text retrieval finds relevant videos given a text query (or vice versa). Approaches mirror image-text retrieval but must handle temporal information:

  1. Encode video clips using a video encoder.
  2. Encode text queries using a text encoder.
  3. Compute similarity in a shared embedding space.

CLIP4Clip and related models adapt CLIP for video-text retrieval by adding temporal pooling or attention over frame-level CLIP features. The simplest approach (mean pooling) averages CLIP features across frames; more sophisticated methods use a temporal transformer to produce a video-level embedding. As we saw with CLIP's image-text alignment in Chapter 28, the shared embedding space enables zero-shot retrieval — searching for videos using natural language without any task-specific training.

VideoCLIP (Xu et al., 2021) takes a different approach: instead of adapting a pre-trained image CLIP model, it trains a video-text contrastive model from scratch on video-text pairs from HowTo100M (100 million narrated instructional video clips). The key innovation is temporally overlapping positive pairs — since video narrations are loosely aligned with visual content, VideoCLIP samples overlapping (but not identical) time windows for positive pairs, creating a softer contrastive signal that handles the inherent temporal misalignment in narrated video.

30.10.3 Video Object Segmentation and Tracking

Given an object annotation in the first frame, video object segmentation (VOS) tracks and segments the object throughout the video. This task requires maintaining object identity across frames even as the object undergoes deformation, partial occlusion, scale changes, and appearance variations. Modern methods like XMem and Cutie use memory-based approaches where object features from previous frames are stored in a memory bank and queried via cross-attention to segment the current frame. This memory-based paradigm is related to the memory mechanisms used in long-form video understanding (Section 30.10.1) and to the key-value memory structure of transformer attention itself. SAM 2, the video extension of SAM discussed in Chapter 26, also uses this memory-based approach, enabling interactive video segmentation where users can provide corrections at any frame and see them propagate through the video.

30.10.4 Egocentric Video Understanding

First-person (egocentric) video from wearable cameras presents unique challenges: frequent hand-object interactions, rapid head movements, and the absence of the observer from the scene. The Ego4D benchmark (Grauman et al., 2022) provides 3,670 hours of daily-life activity video with rich annotations for tasks including episodic memory, future prediction, and social interaction understanding.

Key tasks unique to egocentric video include: - Episodic memory: "Where did I last see my keys?" — requires temporal search over the wearer's visual history. - Future prediction: "What will I do next?" — requires understanding intent and activity patterns. - Hand-object interaction: Tracking which objects the wearer's hands interact with and how. - Social interaction: Understanding the wearer's interactions with other people, including gaze direction and conversational dynamics.

30.10.5 Video Question Answering

Video QA extends the visual question answering framework of Chapter 28 to the temporal domain. Given a video and a natural language question, the model must provide an answer that may require temporal reasoning:

  • "What did the person do after picking up the cup?" (temporal ordering)
  • "How many times did the ball bounce?" (counting over time)
  • "What was the person wearing when they entered the room?" (temporal grounding + attribute recognition)

Modern video QA systems typically use a video encoder to extract per-frame or clip-level features, then feed these as visual tokens to a large language model. Video-LLaVA and Video-ChatGPT, mentioned in Section 30.6.3, represent the current approach: extract frame-level CLIP features, apply temporal pooling or attention, and concatenate the resulting video tokens with the text query as input to the LLM.

30.10.6 Challenges and Future Directions

Several fundamental challenges remain in video AI:

  1. Long-form understanding: Most models process clips of 4-32 frames (1-2 seconds). Understanding a 2-hour movie requires processing thousands of frames while maintaining a coherent model of characters, plot, and visual themes. Current approaches (hierarchical processing, sparse sampling, memory banks) are partial solutions.

  2. Physical world modeling: Generating physically realistic video requires implicit understanding of physics — gravity, collisions, fluid dynamics, material properties. While Sora-class models produce visually impressive results, they still generate physically impossible sequences (objects merging, fluid defying gravity, limbs appearing and disappearing).

  3. Real-time processing: Most video AI models operate offline, processing pre-recorded clips. Real-time applications (autonomous driving, robotic manipulation, live captioning) require streaming architectures with strict latency constraints.

  4. Evaluation: Evaluating video generation quality remains an open problem. FVD captures distributional similarity but misses temporal consistency. Human evaluation is expensive and difficult to standardize. Developing automatic metrics that correlate well with human judgments of video quality — especially temporal coherence — is an active research area.


30.11 Exercises

  1. Temporal sampling analysis: Load a 30-second video of a complex activity (e.g., cooking). Extract clips using (a) uniform sampling of 16 frames, (b) dense sampling of 16 consecutive frames from a random start, and (c) multi-scale sampling (8 frames at 1 FPS + 8 frames at 8 FPS centered on the midpoint). Classify each clip using a pre-trained VideoMAE model and compare the predictions. Which sampling strategy best captures the activity?

  2. 3D convolution vs. 2D+temporal: Implement both a full 3D convolutional block ($3 \times 3 \times 3$) and a factorized R(2+1)D block ($1 \times 3 \times 3$ then $3 \times 1 \times 1$) with the same number of output channels. Compare the parameter counts, FLOPs, and output shapes for an input of size $[1, 64, 16, 56, 56]$.

  3. Attention pattern visualization: Using a pre-trained TimeSformer, extract and visualize the temporal attention maps for a video clip. For a specific patch position in the middle frame, plot which frames it attends to most strongly. Does the temporal attention pattern differ between action-relevant and static background patches?

  4. Video classification pipeline: Build a complete video classification pipeline for a custom dataset using VideoMAE. Implement data loading with decord, temporal sampling, spatial augmentation (ensuring consistency across frames), and training with the HuggingFace Trainer. Report top-1 and top-5 accuracy on UCF-101.

  5. Optical flow analysis: Compute RAFT optical flow between consecutive frame pairs for a 5-second video. Compute the average flow magnitude per frame and plot it over time. Can you identify the moments of fastest motion? How does this correlate with semantically meaningful events in the video?

30.12 Summary

Video understanding and generation represent some of the most challenging and rapidly advancing areas of AI. The computational demands of processing spatiotemporal data make video AI a frontier that pushes the limits of both algorithms and hardware. The key ideas covered in this chapter are:

  1. Video representation: Videos are 5D tensors $(B, C, T, H, W)$ that encode both spatial appearance and temporal dynamics. Temporal sampling strategies balance coverage and computational cost.

  2. 3D convolutions extend spatial convolutions to the temporal dimension, with I3D's inflation technique enabling transfer from pre-trained image models. Factorized approaches (R(2+1)D, SlowFast) improve efficiency.

  3. Video transformers (TimeSformer, ViViT, Video Swin) apply attention to spatiotemporal tokens, with factorized attention schemes (divided space-time, factorized encoder) making computation tractable.

  4. Video classification (action recognition) is evaluated on benchmarks like Kinetics-400, with modern transformers achieving over 80% top-1 accuracy through large-scale pre-training.

  5. Temporal action detection localizes actions within untrimmed videos, with transformer-based approaches like ActionFormer achieving strong results.

  6. Video captioning and video-language models extend image-language architectures to the temporal domain, enabling natural language descriptions and conversations about video content.

  7. Video generation has been transformed by diffusion models, with architectures that add temporal attention to image diffusion models achieving remarkable temporal consistency.

  8. Optical flow provides explicit motion estimation and remains useful for video analysis, with RAFT representing the current state of the art in deep optical flow estimation.

The field continues to evolve rapidly, with video foundation models, long-form video understanding, and high-quality video generation representing the current frontier.

Practical recommendations for getting started with video AI: 1. For action recognition: Start with a pre-trained VideoMAE or Video Swin model from HuggingFace, fine-tune on your dataset using the recipe described in Section 30.4.5. 2. For video retrieval: Extract per-frame CLIP features and apply mean pooling for a simple but effective baseline. CLIP4Clip provides a more sophisticated temporal-aware alternative. 3. For video captioning: Use Video-LLaVA or a similar video-language model. For custom domains, fine-tune the model on domain-specific video-caption pairs. 4. For video generation: Use Stable Video Diffusion for image-to-video, or AnimateDiff for text-to-video with controllable motion. 5. For deployment: Always consider the computational budget carefully. Video AI is 10-100x more expensive than image AI for the same model architecture due to the temporal dimension.

The techniques, architectures, and principles covered in this chapter — from 3D convolutions through video transformers to diffusion-based generation — provide the foundation for understanding the rapidly evolving landscape of video AI. As models continue to scale and new architectures emerge, the core concepts of temporal modeling, efficient spatiotemporal attention, and cross-modal alignment will remain central to progress in this exciting field.


References

  • Tran, D., Bourdev, L., et al. (2015). Learning Spatiotemporal Features with 3D Convolutional Networks. ICCV 2015.
  • Carreira, J. & Zisserman, A. (2017). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CVPR 2017.
  • Tran, D., Wang, H., et al. (2018). A Closer Look at Spatiotemporal Convolutions for Action Recognition. CVPR 2018.
  • Feichtenhofer, C., Fan, H., et al. (2019). SlowFast Networks for Video Recognition. ICCV 2019.
  • Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding? ICML 2021.
  • Arnab, A., Dehghani, M., et al. (2021). ViViT: A Video Vision Transformer. ICCV 2021.
  • Liu, Z., Ning, J., et al. (2022). Video Swin Transformer. CVPR 2022.
  • Teed, Z. & Deng, J. (2020). RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. ECCV 2020.
  • Ho, J., Salimans, T., et al. (2022). Video Diffusion Models. NeurIPS 2022.
  • Tong, Z., Song, Y., et al. (2022). VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS 2022.
  • Grauman, K., Westbury, A., et al. (2022). Ego4D: Around the World in 3,000 Hours of Egocentric Video. CVPR 2022.
  • Zhang, C., Wu, J., & Li, Y. (2022). ActionFormer: Localizing Moments of Actions with Transformers. ECCV 2022.
  • Simonyan, K. & Zisserman, A. (2014). Two-Stream Convolutional Networks for Action Recognition in Videos. NeurIPS 2014.
  • Xu, H., Ghosh, G., et al. (2021). VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. EMNLP 2021.
  • Feichtenhofer, C. (2020). X3D: Expanding Architectures for Efficient Video Recognition. CVPR 2020.