Chapter 30: Quiz — Video Understanding and Generation

Test your understanding of video understanding with these questions. Try to answer each question before revealing the solution.

Question 1

Why can't we simply apply a standard 2D image classifier independently to each frame and average predictions for video classification?

Show Answer

While per-frame classification with averaging works as a baseline, it fundamentally cannot capture temporal dynamics. Many video understanding tasks require reasoning about how things change over time: (1) Actions like "picking up" vs. "putting down" may look identical in any single frame but differ in temporal order. (2) Activities like "running" require observing motion across frames. (3) Interactions between objects (e.g., "catching a ball") are defined by temporal relationships. (4) The Something-Something v2 dataset specifically tests this — models that process frames independently achieve less than 30% accuracy on actions like "pushing something from left to right" vs. "pushing something from right to left." Temporal modeling through 3D convolutions, temporal attention, or optical flow is essential for capturing these dynamics.

Question 2

How do 3D convolutions extend 2D convolutions to model temporal information?

Show Answer

3D convolutions use kernels of size $k_T \times k_H \times k_W$ that slide over both spatial and temporal dimensions simultaneously. For an input of shape $(C_{\text{in}}, T, H, W)$, a 3D kernel at position $(t, h, w)$ computes a weighted sum over a local spatiotemporal neighborhood spanning $k_T$ frames, $k_H$ pixels vertically, and $k_W$ pixels horizontally. A common kernel size is $3 \times 3 \times 3$, which captures local motion patterns within a 3-frame window. Stacking multiple 3D convolutional layers builds up increasingly large temporal receptive fields, allowing the network to model longer-range temporal dependencies. The key difference from 2D convolutions is that the kernel's temporal dimension enables learning motion-specific filters (e.g., edge detectors that activate on edges moving in a particular direction).

Question 3

What is the I3D inflation technique, and why is it beneficial for video model training?

Show Answer

I3D inflation converts pre-trained 2D CNN weights into 3D by replicating each 2D kernel $k_T$ times along the temporal dimension and dividing by $k_T$: $\mathbf{W}_{3D}[:,:,\tau,:,:] = \mathbf{W}_{2D}/k_T$ for all $\tau$. The division ensures that when applied to a static video (identical frames), the inflated 3D network produces the same output as the original 2D network. This is beneficial because: (1) It transfers rich spatial representations learned from millions of ImageNet images, providing a strong initialization for the spatial processing part of the video model. (2) It dramatically reduces the data and compute needed to train video models — training from scratch on video datasets is much harder due to data scarcity. (3) The temporal filters start from a reasonable initialization (averaging over time) and quickly learn motion-specific patterns during fine-tuning.

Question 4

Explain the divided space-time attention in TimeSformer and why it is preferred over joint spatiotemporal attention.

Show Answer

Divided space-time attention processes temporal and spatial attention separately within each transformer layer. First, temporal attention: each token at position $(t, i)$ attends to the same spatial position across all frames $\{(1,i), (2,i), \ldots, (T,i)\}$. Then, spatial attention: each token attends to all spatial positions within its frame $\{(t,1), (t,2), \ldots, (t,N)\}$. This is preferred because: (1) **Computational cost**: Joint attention has complexity $O((TN)^2 d)$, while divided attention has $O(TN(T+N)d)$. For $T=8, N=196$, this is $\sim$2.5M vs. $\sim$320K attention computations per token — an 8x reduction. (2) **Performance**: Despite the factorization, divided attention achieves comparable or better accuracy than joint attention, likely because the separate operations allow the model to specialize temporal and spatial processing. (3) **Memory**: Joint attention requires materializing a $(TN) \times (TN)$ attention matrix, which can exceed GPU memory for longer videos.

Question 5

How do SlowFast Networks model both fast and slow temporal dynamics?

Show Answer

SlowFast Networks use two parallel pathways: (1) The **Slow pathway** operates at a low frame rate (e.g., 4 FPS, sampling every 16th frame) with a high channel capacity, capturing spatial semantics and slowly changing information like scene appearance and object identity. (2) The **Fast pathway** operates at a high frame rate (e.g., 32 FPS, sampling every 2nd frame) with a low channel capacity (1/8 of Slow), capturing rapid temporal changes like fine-grained motion and quick actions. Lateral connections (using temporal strided convolutions) transfer information from the Fast to the Slow pathway at each resolution level, allowing the Slow pathway to incorporate motion information. This design is inspired by the primate visual system's magnocellular (fast, motion-sensitive) and parvocellular (slow, detail-sensitive) pathways. The asymmetric channel allocation makes the Fast pathway lightweight (~20% of total compute).

Question 6

What is the difference between uniform and dense temporal sampling, and when would you choose each?

Show Answer

**Uniform sampling** selects $T$ frames evenly spaced throughout the entire video. For a video with $N$ frames and $T$ desired frames, it samples at indices $\{0, \lfloor N/T \rfloor, 2\lfloor N/T \rfloor, \ldots\}$. This provides broad temporal coverage but may miss fast actions that occur between sampled frames. Choose this when videos contain long-duration activities or when you need to understand the overall narrative (e.g., video summarization, movie genre classification). **Dense sampling** extracts a clip of $T$ consecutive frames from a random starting point. This captures fine-grained temporal dynamics within a short window but has limited temporal coverage. Choose this when the task requires understanding short-duration actions with precise timing (e.g., Something-Something v2, sports action recognition, gesture recognition). In practice, dense sampling is used during training (with random start positions for augmentation), and multi-clip uniform sampling is used during evaluation to cover the full video.

Question 7

How does Video Swin Transformer extend the shifted window mechanism to 3D?

Show Answer

Video Swin Transformer extends 2D shifted windows to 3D by defining windows of size $T_w \times M \times M$ (e.g., $2 \times 7 \times 7$) in the spatiotemporal volume. Self-attention is computed within each 3D window, giving complexity $O(T_w \cdot M^2)$ per token (linear in total tokens). To enable cross-window information flow, the 3D window grid is shifted by $(T_w/2, M/2, M/2)$ in alternating layers, analogous to the 2D shift in Swin Transformer. Cyclic shifting is applied along all three dimensions with appropriate attention masking. The 3D relative position bias is extended to capture relative positions in time, height, and width: the bias table has size $(2T_w-1) \times (2M-1) \times (2M-1)$, providing a 3D position-aware attention mechanism.

Question 8

What is temporal action detection, and how does it differ from action recognition?

Show Answer

**Action recognition** classifies a pre-segmented video clip into one of several action categories. The temporal boundaries are given — the model only needs to predict *what* action is happening. **Temporal action detection** operates on long, untrimmed videos and must identify both *what* actions occur and *when* they occur by predicting temporal segments $(t_{\text{start}}, t_{\text{end}}, \text{class}, \text{confidence})$ for each action instance. It is analogous to object detection in images (where you must localize objects, not just classify the whole image). Challenges unique to temporal action detection include: (1) Highly variable action durations (from 1 second to several minutes). (2) Background segments between actions that must be classified as "no action." (3) Overlapping actions (multiple actions happening simultaneously). (4) Precise boundary detection is difficult because actions often have gradual transitions.

Question 9

Why is RAFT considered the state-of-the-art approach for optical flow estimation?

Show Answer

RAFT (Recurrent All-Pairs Field Transforms) excels because of three key design choices: (1) **All-pairs correlation volume**: RAFT computes dot products between all pairs of pixels from two frames, creating a 4D volume that encapsulates all possible correspondences. This provides rich matching information without requiring pre-defined search windows. (2) **Iterative refinement**: A GRU-based update operator iteratively refines the flow estimate by looking up the correlation volume at the current predicted correspondence location. This iterative approach can handle large motions and progressively sharpen estimates. (3) **Multi-scale correlation**: The correlation volume is computed at multiple scales, enabling both coarse and fine matching. RAFT achieves state-of-the-art results on Sintel and KITTI benchmarks, runs efficiently on GPUs, and generalizes well across datasets — making it the practical default for optical flow estimation.

Question 10

How do video diffusion models maintain temporal consistency during generation?

Show Answer

Video diffusion models add temporal attention and temporal convolution layers to the image diffusion architecture (typically a U-Net or transformer). At each resolution level, the network alternates between: (1) **Spatial processing**: Each frame is processed independently through spatial convolutions and self-attention, generating per-frame features. (2) **Temporal processing**: Temporal attention layers attend across frames at each spatial position, ensuring that features at the same location are consistent over time. Temporal convolutions apply 1D convolutions along the time axis. The forward process adds noise independently to each frame, but the reverse (denoising) process jointly denoises all frames, with temporal attention providing cross-frame communication. Additionally, some models use temporal coherence losses or train on real video data where consecutive frames are naturally coherent. The result is that generated frames share consistent global structure, lighting, color palettes, and object positions.

Question 11

What are the advantages and disadvantages of processing video at the feature level (per-frame image features + temporal aggregation) versus end-to-end spatiotemporal processing?

Show Answer

**Feature-level processing** (per-frame CLIP/ViT features + temporal transformer/pooling): - Advantages: (1) Leverages powerful pre-trained image models. (2) Computationally efficient — feature extraction is parallelizable across frames. (3) Flexible — can easily swap temporal aggregation methods. (4) Works well for tasks dominated by appearance (scene recognition, video retrieval). - Disadvantages: (1) Each frame is processed independently, so spatial features are unaware of temporal context. (2) Fine-grained motion information may be lost in the feature extraction step. (3) Two-stage training is suboptimal compared to end-to-end learning. **End-to-end spatiotemporal processing** (3D CNNs, video transformers): - Advantages: (1) Joint spatiotemporal features can capture motion-dependent spatial patterns. (2) End-to-end training optimizes the full pipeline. (3) Better for motion-critical tasks (Something-Something, sports analysis). - Disadvantages: (1) Much higher computational cost. (2) Requires large-scale video pre-training data. (3) Harder to initialize from image models.

Question 12

How does VideoMAE adapt the masked autoencoder framework for video, and why does it use a higher masking ratio than image MAE?

Show Answer

VideoMAE extends MAE to video by: (1) Using tubelet embeddings to tokenize video into spatiotemporal patches. (2) Randomly masking 90% of tubes (the same spatial position is masked across multiple frames). (3) The encoder processes only the 10% visible tubes, and the decoder reconstructs all masked tubes from the encoded visible tokens plus mask tokens. The 90% masking ratio (vs. 75% for images) is justified by video's high temporal redundancy: consecutive frames are very similar, so even with 90% masking, sufficient information leaks through the visible tubes to make reconstruction possible. If only 75% were masked, the temporal redundancy would make the pre-training task too easy — the model could reconstruct masked patches by simply copying from nearby frames in the visible set, learning a trivial copy operation rather than deep visual understanding. The higher masking ratio forces the model to learn meaningful spatiotemporal representations and also provides a computational benefit: processing only 10% of tubes makes the encoder much faster.

Question 13

Describe three practical strategies for reducing GPU memory consumption when training video models.

Show Answer

(1) **Gradient checkpointing**: Instead of storing all intermediate activations during the forward pass (which for video models can consume tens of GB), recompute them during the backward pass. This trades approximately 30% more computation for 60-70% memory reduction. In PyTorch, use `torch.utils.checkpoint.checkpoint()` on individual blocks. (2) **Mixed precision training (FP16/BF16)**: Store activations and gradients in half precision, reducing memory by approximately 50%. Use `torch.cuda.amp.autocast()` with a gradient scaler to prevent underflow. BF16 is preferred on Ampere+ GPUs as it avoids the gradient scaling complexity. (3) **Gradient accumulation**: Simulate a large batch size by accumulating gradients over multiple forward-backward passes before updating weights. For example, with a physical batch of 2 and 8 accumulation steps, the effective batch size is 16 while only 2 videos are in memory at once. This is essential because video tensors are large (a 16-frame clip at 224x224 is 16x larger than a single image).