Chapter 26: Exercises — Vision Transformers and Modern Computer Vision

Conceptual Exercises

Exercise 1: Patch Embedding Dimensions

An image of size $384 \times 384 \times 3$ is processed by a ViT with patch size $P = 32$. Calculate: (a) the number of patches, (b) the flattened patch dimension, and (c) the total sequence length including the [CLS] token.

Exercise 2: Computational Cost Comparison

Compare the FLOPs of self-attention for ViT-B/16 on a $224 \times 224$ image versus a $384 \times 384$ image. By what factor does the attention cost increase?

Exercise 3: Inductive Bias Analysis

Explain why ViT performs worse than CNNs on small datasets but better on large datasets. Relate your answer to the concept of inductive biases and the bias-variance tradeoff.

Exercise 4: Position Embedding Interpolation

A ViT-B/16 is pre-trained on $224 \times 224$ images (196 patches + 1 CLS token). You want to fine-tune on $384 \times 384$ images. (a) How many patches will the new resolution produce? (b) Describe how position embeddings should be adapted. (c) Why does bicubic interpolation of a 2D reshape work better than 1D interpolation?

Exercise 5: Swin Window Attention Complexity

For a Swin Transformer processing a $56 \times 56$ feature map with window size $M = 7$, calculate: (a) the number of windows, (b) the self-attention FLOPs per window, and (c) the total attention FLOPs. Compare this with global self-attention FLOPs.

Exercise 6: DETR Object Queries

Explain the role of object queries in DETR. Why does DETR use a fixed number of queries (e.g., 100) and what happens if the image contains more objects than queries?

Exercise 7: Shifted Window Mechanism

Draw a diagram showing how the shifted window mechanism in Swin Transformer creates cross-window connections. Explain why cyclic shifting is more efficient than padding.

Exercise 8: DeiT Distillation

Compare soft distillation and hard-label distillation in DeiT. Why might hard-label distillation outperform soft distillation despite losing probability distribution information?

Exercise 9: MAE Masking Ratio

Explain why Masked Autoencoders (MAE) use a 75% masking ratio for images while BERT uses only 15% for text. What property of images makes higher masking possible?

Exercise 10: ViT Attention Patterns

Describe how attention patterns in ViT evolve across layers (early, middle, late). How does this compare to the feature hierarchy in CNNs?

Implementation Exercises

Exercise 11: Patch Embedding Module

Implement a PatchEmbedding module in PyTorch that takes a batch of images and produces patch embeddings with position embeddings and a [CLS] token.

Exercise 12: Multi-Head Self-Attention

Implement the multi-head self-attention mechanism for ViT, including the pre-norm convention used in the original paper.

Exercise 13: ViT Block

Implement a complete ViT transformer block including Layer Normalization, Multi-Head Self-Attention, and the Feed-Forward Network with GELU activation. Include residual connections.

Exercise 14: Complete ViT Model

Combine the patch embedding, transformer blocks, and classification head into a complete ViT model. Verify the number of parameters matches ViT-Base specifications.

Exercise 15: Window Attention

Implement the window-based self-attention mechanism from Swin Transformer. Your implementation should partition the feature map into non-overlapping windows and compute attention within each window.

Exercise 16: Shifted Window Attention

Extend your window attention implementation to include the shifted window mechanism with cyclic shifting and attention masking.

Exercise 17: Patch Merging

Implement the patch merging layer used in Swin Transformer to downsample the spatial resolution by 2x.

Exercise 18: Relative Position Bias

Implement the relative position bias table used in Swin Transformer's attention mechanism.

Exercise 19: Bipartite Matching

Implement the Hungarian matching algorithm used in DETR's loss computation. Use scipy.optimize.linear_sum_assignment for the actual matching.

Exercise 20: Attention Map Visualization

Write code to extract and visualize attention maps from a pre-trained ViT model for a given input image. Display attention from different heads and layers.

Applied Exercises

Exercise 21: Fine-Tuning Comparison

Fine-tune both ViT-Base and ResNet-50 (from torchvision) on CIFAR-100 with identical training recipes. Compare: (a) final accuracy, (b) convergence speed, (c) training FLOPs, and (d) inference latency.

Exercise 22: Data Augmentation Ablation

Train DeiT-Small on CIFAR-100 with progressively removed augmentations (Mixup, CutMix, RandAugment, random erasing). Report accuracy for each configuration and discuss which augmentations are most critical for ViTs.

Exercise 23: Resolution Scaling

Fine-tune a pre-trained ViT-B/16 at resolutions 224, 384, and 512. Plot accuracy vs. resolution and FLOPs vs. resolution. At what resolution does the accuracy improvement no longer justify the computational cost?

Exercise 24: Layer-Wise Learning Rate Decay

Implement layer-wise learning rate decay for ViT fine-tuning. Compare training with uniform learning rate vs. decay factors of 0.65, 0.75, and 0.85 on a downstream task.

Exercise 25: Knowledge Distillation

Train a ViT-Small student model using a pre-trained ViT-Large as the teacher. Implement both soft and hard distillation, and compare results with training the student from scratch.

Exercise 26: Object Detection Pipeline

Build a complete object detection pipeline using DETR from HuggingFace. Evaluate on a custom dataset of at least 5 classes, reporting mAP at IoU thresholds of 0.5 and 0.75.

Exercise 27: Semantic Segmentation

Fine-tune a SegFormer model on a small segmentation dataset (e.g., Oxford-IIIT Pet). Compare results with a CNN-based segmentation model (e.g., DeepLabV3).

Exercise 28: Self-Supervised Pre-Training

Implement a simplified version of DINO self-supervised training for ViT-Tiny on CIFAR-10. Evaluate the quality of learned features using k-NN classification.

Exercise 29: Hybrid Architecture

Design and implement a hybrid CNN-Transformer architecture that uses convolutional layers for the first two stages and self-attention for the remaining stages. Compare with pure CNN and pure ViT baselines on CIFAR-100.

Exercise 30: Swin Transformer for Detection

Use a Swin Transformer backbone with a Feature Pyramid Network for object detection. Compare against a ResNet-50 + FPN baseline on the same dataset.

Challenge Exercises

Exercise 31: Efficient ViT

Implement three different efficiency improvements for ViT: (a) token pruning (removing unimportant tokens after early layers), (b) token merging (combining similar tokens), and (c) linear attention approximation. Compare accuracy vs. throughput tradeoffs.

Exercise 32: MAE Implementation

Implement a Masked Autoencoder from scratch. Train it on a subset of ImageNet and evaluate the quality of learned representations through linear probing and fine-tuning.

Exercise 33: Cross-Attention for Detection

Implement DETR's cross-attention mechanism where object queries attend to image features. Visualize which image regions each query attends to.

Exercise 34: Multi-Scale ViT

Design a vision transformer that processes images at multiple resolutions simultaneously, using cross-scale attention to exchange information between resolution levels. Evaluate on a dense prediction task.

Exercise 35: ViT Robustness Study

Systematically evaluate a pre-trained ViT and a pre-trained ResNet on: (a) ImageNet-C (corrupted images), (b) ImageNet-R (renditions), (c) patch-level occlusion, and (d) adversarial perturbations (FGSM). Analyze and discuss the robustness differences.

Exercise 36: Positional Encoding Alternatives

Implement and compare four types of position encoding for ViT: (a) learned absolute, (b) sinusoidal 2D, (c) relative position bias (Swin-style), and (d) rotary position embeddings (RoPE adapted for 2D). Evaluate on ImageNet classification.

Exercise 37: ViT Pruning

Implement structured pruning for a ViT model by removing attention heads and FFN neurons based on importance scores. Achieve at least 50% FLOPs reduction with less than 2% accuracy drop.

Exercise 38: Flash Attention Integration

Integrate FlashAttention into a ViT implementation. Benchmark memory usage and throughput at resolutions from 224 to 1024 compared to standard attention.

Exercise 39: Dynamic Token Sampling

Implement an adaptive token selection mechanism that dynamically chooses which image patches to process based on an initial coarse pass. The goal is to reduce computation for "easy" images while maintaining full resolution for "hard" images.

Exercise 40: End-to-End Vision System

Build a complete vision system that takes an input image and performs: (a) classification with ViT, (b) object detection with DETR, and (c) semantic segmentation with SegFormer. Display all results in a unified visualization.