Chapter 30: Further Reading
Foundational Papers
3D Convolutions and Early Video Models
-
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). "Learning Spatiotemporal Features with 3D Convolutional Networks." ICCV 2015. C3D — one of the first successful applications of 3D CNNs for video, establishing that $3 \times 3 \times 3$ kernels are optimal and that conv features serve as effective generic video descriptors.
-
Carreira, J. & Zisserman, A. (2017). "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset." CVPR 2017. Introduces I3D (Inflated 3D ConvNet) and the Kinetics dataset. The inflation technique for transferring ImageNet weights to video models remains widely used.
-
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). "A Closer Look at Spatiotemporal Convolutions for Action Recognition." CVPR 2018. Introduces R(2+1)D factorized convolutions, demonstrating that decomposing 3D convolutions into spatial and temporal components improves both efficiency and accuracy.
Multi-Stream and Efficient Architectures
-
Simonyan, K. & Zisserman, A. (2014). "Two-Stream Convolutional Networks for Action Recognition in Videos." NeurIPS 2014. The foundational two-stream paper, processing RGB frames and optical flow separately. Established the importance of explicit motion representation for video understanding.
-
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). "SlowFast Networks for Video Recognition." ICCV 2019. Introduces the dual-pathway architecture processing video at two temporal resolutions. A key design insight for video architectures.
-
Feichtenhofer, C. (2020). "X3D: Expanding Architectures for Efficient Video Recognition." CVPR 2020. Systematic architecture expansion along multiple axes to find optimal video network designs. Demonstrates principled efficiency improvements.
Video Transformers
-
Bertasius, G., Wang, H., & Torresani, L. (2021). "Is Space-Time Attention All You Need for Video Understanding?" ICML 2021. TimeSformer paper exploring five spatiotemporal attention schemes. Establishes divided space-time attention as the practical choice for video transformers.
-
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., & Schmid, C. (2021). "ViViT: A Video Vision Transformer." ICCV 2021. Explores four designs for video vision transformers, including the factorised encoder and tubelet embedding. Comprehensive analysis of design choices.
-
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). "Video Swin Transformer." CVPR 2022. Extends Swin Transformer to video with 3D shifted windows, achieving state-of-the-art accuracy with efficient attention.
-
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). "Multiscale Vision Transformers." ICCV 2021. MViT introduces multi-scale attention for video transformers through pooling attention, creating a hierarchical transformer architecture.
Self-Supervised Video Learning
-
Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training." NeurIPS 2022. Applies masked autoencoder pre-training to video with 90% masking ratio. Demonstrates that video's temporal redundancy enables very high masking ratios.
-
Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., ... & Qiao, Y. (2022). "InternVideo: General Video Foundation Models via Generative and Discriminative Learning." arXiv:2212.03191. Combines masked video modeling with video-text contrastive learning for a versatile video foundation model.
Optical Flow
-
Teed, Z. & Deng, J. (2020). "RAFT: Recurrent All-Pairs Field Transforms for Optical Flow." ECCV 2020. The current state-of-the-art in deep optical flow estimation. The all-pairs correlation volume and iterative GRU refinement design has become the standard approach.
-
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., ... & Brox, T. (2015). "FlowNet: Learning Optical Flow with Convolutional Networks." ICCV 2015. The first paper to show that neural networks can estimate optical flow end-to-end. Introduced the FlowNetS and FlowNetC architectures.
Temporal Action Detection
-
Zhang, C., Wu, J., & Li, Y. (2022). "ActionFormer: Localizing Moments of Actions with Transformers." ECCV 2022. Applies transformers to temporal action detection with a multi-scale architecture, achieving state-of-the-art results with a simple, elegant design.
-
Lin, T., Liu, X., Li, X., Ding, E., & Wen, S. (2019). "BMN: Boundary-Matching Network for Temporal Action Proposal Generation." ICCV 2019. An influential two-stage approach that generates temporal proposals through boundary probability prediction and boundary-content matching.
Video Generation
-
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., & Fleet, D. J. (2022). "Video Diffusion Models." NeurIPS 2022. Extends image diffusion to video with temporal attention, establishing the foundation for modern video generation.
-
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorber, D., ... & Rombach, R. (2023). "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets." Applies latent diffusion to video generation, building on Stable Diffusion's architecture with temporal extensions.
Datasets and Benchmarks
-
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ... & Zisserman, A. (2017). "The Kinetics Human Action Video Dataset." arXiv:1705.06950. The Kinetics dataset that transformed video understanding research. Multiple versions (400, 600, 700) provide increasingly comprehensive action recognition benchmarks.
-
Grauman, K., Westbury, A., et al. (2022). "Ego4D: Around the World in 3,000 Hours of Egocentric Video." CVPR 2022. A massive egocentric video benchmark with diverse annotations for episodic memory, forecasting, and social interaction understanding.
-
Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., ... & Memisevic, R. (2017). "The 'Something Something' Video Database for Learning and Evaluating Visual Common Sense." ICCV 2017. Something-Something dataset requiring genuine temporal understanding, where individual frames are insufficient for classification.
Surveys
- Selva, J., Johansen, A. S., Escalera, S., Nasrollahi, K., Moeslund, T. B., & Clapés, A. (2023). "Video Transformers: A Survey." IEEE TPAMI. A comprehensive survey of transformer architectures for video understanding, covering classification, detection, segmentation, and generation.
Looking Ahead
The video understanding foundations from this chapter connect to several related topics:
- Chapter 27 (Diffusion Models): Video generation extends image diffusion with temporal attention layers.
- Chapter 28 (Multimodal Models): Video-language models adapt the image-language architectures (LLaVA, BLIP-2) to temporal sequences.
- Chapter 36 (Reinforcement Learning): Video understanding is essential for embodied AI agents that must perceive and act in dynamic environments.