Chapter 30: Exercises — Video Understanding and Generation
Conceptual Exercises
Exercise 1: Video Data Dimensions
A 10-second video at 1080p (1920x1080) resolution and 30 FPS has 3 color channels. (a) Calculate the total number of values in the raw video tensor. (b) If each value is a float32, how much memory does this require? (c) Why is temporal sampling necessary for deep learning models?
Exercise 2: 3D Convolution Parameters
A 3D convolutional layer has 64 input channels, 128 output channels, and kernel size $3 \times 3 \times 3$. (a) How many parameters does it have (including bias)? (b) Compare this with the equivalent 2D convolution ($3 \times 3$). (c) How does R(2+1)D factorization reduce the parameter count?
Exercise 3: I3D Inflation
Explain how I3D inflates a 2D pre-trained kernel $\mathbf{W}_{2D} \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}} \times 3 \times 3}$ into a 3D kernel $\mathbf{W}_{3D} \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}} \times 3 \times 3 \times 3}$. Why is dividing by $k_T$ important? What would the output be for a static video (identical frames) with and without this normalization?
Exercise 4: TimeSformer Attention Complexity
For a video with $T = 8$ frames and $N = 196$ spatial patches per frame: (a) Compute the token count for full spatiotemporal attention and its complexity. (b) Compute the complexity for divided space-time attention. (c) What is the speedup factor?
Exercise 5: SlowFast Design Rationale
Explain the biological inspiration behind SlowFast Networks. Why does the Fast pathway use 1/8 of the channels compared to the Slow pathway? What types of motions does each pathway capture?
Exercise 6: ViViT Tubelet Embedding
Compare ViViT's tubelet embedding (extracting 3D patches of size $2 \times 16 \times 16$) with frame-level patch embedding. (a) How many tokens does each produce for a 16-frame, 224x224 video? (b) What temporal information does the tubelet embedding capture that frame-level cannot?
Exercise 7: Video vs. Image Masking Ratio
VideoMAE uses a 90% masking ratio compared to 75% for image MAE. Explain why video requires a higher masking ratio. What property of video makes reconstruction easier even with more masking?
Exercise 8: Temporal Action Detection Metrics
Explain the difference between mAP@0.3, mAP@0.5, and mAP@0.7 in temporal action detection. Why is mAP@0.7 much harder than mAP@0.3? How does this relate to the precision of temporal boundary detection?
Exercise 9: Video Generation Challenges
Explain why generating each frame independently produces poor video quality. What specific artifacts occur? How do temporal attention layers in video diffusion models address these issues?
Exercise 10: Optical Flow Limitations
The brightness constancy assumption $I(x, y, t) = I(x + u, y + v, t + 1)$ fails in several scenarios. List three specific failure cases and explain why each violates the assumption.
Implementation Exercises
Exercise 11: Video Loading and Preprocessing
Write a PyTorch Dataset class that loads videos using torchvision.io.read_video, performs uniform temporal sampling of 16 frames, applies spatial center-cropping to 224x224, and normalizes with ImageNet statistics.
Exercise 12: 3D Convolution Module
Implement a 3D residual block with two $3 \times 3 \times 3$ convolutions, batch normalization, ReLU activation, and a skip connection. Verify the output shape matches the input shape.
Exercise 13: R(2+1)D Factorization
Implement R(2+1)D factorized convolution: a spatial $1 \times 3 \times 3$ convolution followed by a temporal $3 \times 1 \times 1$ convolution with ReLU between them. Compare parameter count and FLOPs with a full $3 \times 3 \times 3$ convolution.
Exercise 14: Divided Space-Time Attention
Implement TimeSformer's divided space-time attention for a single transformer layer. Given input tokens of shape $(B, T \times N, D)$, implement temporal attention (within each spatial position) and spatial attention (within each frame).
Exercise 15: Tubelet Embedding
Implement ViViT's tubelet embedding layer using nn.Conv3d with appropriate kernel and stride. Process a batch of videos and verify the output token count.
Exercise 16: Video Augmentation Pipeline
Implement a video-specific augmentation pipeline that applies consistent spatial transformations (random crop, horizontal flip, color jitter) across all frames in a clip. Verify consistency by visualizing augmented frames.
Exercise 17: Multi-Clip Evaluation
Implement the multi-clip, multi-crop evaluation protocol: sample $K = 10$ clips from a video, apply 3 spatial crops to each, average all $K \times 3$ predictions, and output the final class.
Exercise 18: Optical Flow with RAFT
Use the pre-trained RAFT model from torchvision to compute optical flow between consecutive frames. Visualize the flow using a color wheel encoding (hue = direction, saturation = magnitude).
Exercise 19: Temporal Pooling Strategies
Implement and compare three temporal pooling strategies: (a) average pooling across time, (b) max pooling across time, (c) temporal attention pooling with learnable queries. Measure performance differences on a small classification task.
Exercise 20: Video Feature Extraction
Extract per-frame CLIP features for a set of videos and implement three temporal aggregation methods: mean pooling, temporal transformer, and LSTM. Compare the resulting video representations for retrieval.
Applied Exercises
Exercise 21: Action Recognition on UCF-101
Fine-tune a pre-trained Video Swin Transformer on UCF-101. Report top-1 and top-5 accuracy. Compare with a baseline that averages per-frame CLIP features.
Exercise 22: Video Classification Pipeline
Build a complete video classification pipeline: data loading, augmentation, model training with mixed precision, multi-clip evaluation, and confusion matrix visualization. Target 85%+ accuracy on UCF-101.
Exercise 23: Video-Text Retrieval
Implement a video-text retrieval system using per-frame CLIP features with temporal aggregation. Evaluate on a subset of MSR-VTT using Recall@1, Recall@5, and Recall@10.
Exercise 24: Video Summarization
Build a simple video summarization system that: (a) extracts features for each frame, (b) clusters frames by visual similarity, (c) selects representative frames from each cluster, and (d) generates captions for selected frames.
Exercise 25: Activity Detection
Implement a simple temporal action detection system: extract features with a sliding window, classify each window, and apply non-maximum suppression to produce temporal segments. Evaluate on a subset of ActivityNet.
Challenge Exercises
Exercise 26: Video Transformer from Scratch
Implement a complete video vision transformer with tubelet embedding, divided space-time attention, and classification head. Train on Kinetics-400 (or a subset) and report accuracy.
Exercise 27: SlowFast Implementation
Implement the SlowFast architecture with lateral connections between the two pathways. Train on UCF-101 and compare with single-pathway baselines.
Exercise 28: Video Diffusion
Extend a pre-trained image diffusion model with temporal attention layers for video generation. Generate 16-frame clips conditioned on text prompts and evaluate temporal consistency.
Exercise 29: Long-Form Video Understanding
Build a hierarchical video understanding system: extract clip-level features, aggregate with a temporal transformer, and classify hour-long videos. Test on a dataset of movie genres or lecture topics.
Exercise 30: Video Object Tracking
Implement a simple video object tracking system using CLIP features: given a bounding box in the first frame, track the object through subsequent frames using feature matching. Compare with correlation-based tracking.