Chapter 26: Further Reading
Vision Transformer Foundations
-
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021. The original ViT paper demonstrating that a standard transformer applied directly to image patches achieves competitive image classification results, especially when pre-trained on large datasets.
-
Touvron, H., Cord, M., Douze, M., et al. (2021). "Training Data-Efficient Image Transformers & Distillation through Attention." ICML 2021. DeiT introduces training recipes (augmentation, regularization, distillation) that make ViTs competitive on ImageNet-1K without large-scale pre-training.
-
Steiner, A., Kolesnikov, A., Zhai, X., et al. (2022). "How to Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers." TMLR 2022. A systematic study of training strategies for ViTs, providing practical guidance on augmentation, regularization, and model selection.
Hierarchical and Efficient Architectures
-
Liu, Z., Lin, Y., Cao, Y., et al. (2021). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." ICCV 2021. Introduces shifted window attention and hierarchical feature maps, making vision transformers practical for dense prediction tasks.
-
Liu, Z., Hu, H., Lin, Y., et al. (2022). "Swin Transformer V2: Scaling Up Capacity and Resolution." CVPR 2022. Extends Swin Transformer to 3 billion parameters with techniques for training stability at scale.
-
Dong, X., Bao, J., Chen, D., et al. (2022). "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Window Self-Attention." CVPR 2022. Proposes cross-shaped window attention as an alternative to shifted windows for improved efficiency.
Object Detection with Transformers
-
Carion, N., Massa, F., Synnaeve, G., et al. (2020). "End-to-End Object Detection with Transformers." ECCV 2020. The DETR paper that eliminates hand-designed components in object detection through set prediction with transformers.
-
Zhu, X., Su, W., Lu, L., et al. (2021). "Deformable DETR: Deformable Transformers for End-to-End Object Detection." ICLR 2021. Addresses DETR's slow convergence and poor small-object performance through deformable attention mechanisms.
-
Zhang, H., Li, F., Liu, S., et al. (2023). "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection." ICLR 2023. Advances DETR with contrastive denoising training and mixed query selection, achieving state-of-the-art detection results.
-
Zhao, Y., Lv, W., Xu, S., et al. (2024). "DETRs Beat YOLOs on Real-time Object Detection." CVPR 2024. RT-DETR demonstrates that transformer-based detectors can match or exceed YOLO models in real-time detection scenarios.
Segmentation with Vision Transformers
-
Xie, E., Wang, W., Yu, Z., et al. (2021). "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers." NeurIPS 2021. A hierarchical transformer with efficient self-attention and a lightweight MLP decoder for semantic segmentation.
-
Kirillov, A., Mintun, E., Ravi, N., et al. (2023). "Segment Anything." ICCV 2023. The SAM paper, introducing a foundation model for image segmentation that generalizes to unseen objects and domains.
-
Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). "Vision Transformers for Dense Prediction." ICCV 2021. DPT uses plain ViT features with a convolutional decoder for monocular depth estimation and semantic segmentation.
Self-Supervised Vision Transformers
-
He, K., Chen, X., Xie, S., et al. (2022). "Masked Autoencoders Are Scalable Vision Learners." CVPR 2022. MAE applies masked prediction to images with 75% masking ratio, learning powerful representations through pixel reconstruction.
-
Caron, M., Touvron, H., Misra, I., et al. (2021). "Emerging Properties in Self-Supervised Vision Transformers." ICCV 2021. DINO reveals that self-supervised ViTs learn features with emergent object segmentation properties and excellent k-NN classification performance.
-
Oquab, M., Darcet, T., Moutakanni, T., et al. (2024). "DINOv2: Learning Robust Visual Features without Supervision." TMLR 2024. Scales self-supervised vision transformer training with curated data, producing features that rival supervised pre-training across diverse tasks.
Hybrid and Modern Architectures
-
Liu, Z., Mao, H., Wu, C.-Y., et al. (2022). "A ConvNet for the 2020s." CVPR 2022. ConvNeXt modernizes ResNet with ViT-inspired design choices, demonstrating that pure CNNs remain competitive with vision transformers.
-
Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). "CoAtNet: Marrying Convolution and Attention for All Data Sizes." NeurIPS 2021. Systematically combines depthwise convolutions and self-attention, achieving strong performance across dataset sizes.
-
Tu, Z., Talebi, H., Zhang, H., et al. (2022). "MaxViT: Multi-Axis Vision Transformer." ECCV 2022. Proposes multi-axis attention that combines local (window) and global (grid) attention patterns within each block.
Efficiency and Scaling
-
Dao, T., Fu, D. Y., Ermon, S., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. An IO-aware attention implementation that dramatically improves speed and memory efficiency for vision transformer training.
-
Dehghani, M., Mustafa, B., Djolonga, J., et al. (2024). "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution." NeurIPS 2024. NaViT eliminates the need for fixed-resolution inputs by packing patches from variable-resolution images into the same batch.
-
Bolya, D., Fu, C.-Y., Dai, X., et al. (2023). "Token Merging: Your ViT But Faster." ICLR 2023. A simple token reduction method that merges similar tokens to accelerate ViT inference without requiring retraining.