Chapter 26: Quiz — Vision Transformers and Modern Computer Vision

Test your understanding of vision transformers with these questions. Try to answer each question before revealing the solution.

Question 1

In the original ViT architecture, how is an image converted into a sequence suitable for the transformer?

Show Answer

The image is divided into fixed-size non-overlapping patches (e.g., 16x16 pixels). Each patch is flattened into a vector and linearly projected to the model's embedding dimension. A learnable [CLS] token is prepended, and learnable position embeddings are added to each token. For a 224x224 image with 16x16 patches, this produces 196 patch tokens + 1 [CLS] token = 197 tokens.

Question 2

What is the purpose of the [CLS] token in ViT, and why is it needed?

Show Answer

The [CLS] token serves as a global image representation. It does not correspond to any specific image patch but aggregates information from all patches through self-attention across transformer layers. After the final layer, the [CLS] token's output is passed to the classification head. It is needed because ViT requires a fixed-size representation for classification, and the [CLS] token provides this by acting as a learnable query that collects information from the entire image.

Question 3

Why does the original ViT require large-scale pre-training datasets (like JFT-300M) to match CNN performance?

Show Answer

ViT lacks the inductive biases built into CNNs — namely translation equivariance (from weight sharing) and locality (from small convolutional kernels). These biases act as strong priors that help CNNs learn efficiently from limited data. Without these priors, ViT must learn spatial relationships entirely from data, requiring much larger datasets. With sufficient data, this flexibility becomes an advantage because the model is not constrained by potentially suboptimal architectural assumptions.

Question 4

What are the three key training strategies that DeiT uses to train ViT competitively on ImageNet-1K alone?

Show Answer

1. **Strong data augmentation**: RandAugment, Mixup, CutMix, and random erasing to compensate for the lack of convolutional inductive biases. 2. **Heavy regularization**: Stochastic depth (randomly dropping entire transformer layers), label smoothing, and repeated augmentation to prevent overfitting. 3. **Knowledge distillation**: Using a CNN teacher (RegNet) with a special distillation token to transfer inductive biases. Hard-label distillation (using the teacher's argmax prediction) proved more effective than soft distillation.

Question 5

Explain the shifted window mechanism in Swin Transformer. Why is it needed?

Show Answer

Swin Transformer computes self-attention within non-overlapping local windows (e.g., 7x7 patches) for efficiency. However, this prevents information exchange between windows. The shifted window mechanism alternates between regular window partitioning and a shifted partitioning (offset by half the window size) in consecutive transformer layers. This creates cross-window connections, allowing information to flow across window boundaries. The shift is implemented efficiently using cyclic shifting of the feature map combined with attention masking, avoiding the need for padding.

Question 6

How does Swin Transformer's computational complexity compare to standard ViT for an image with $N$ patches?

Show Answer

Standard ViT has $O(N^2)$ complexity for self-attention because every patch attends to every other patch. Swin Transformer has $O(N \cdot M^2)$ complexity, where $M$ is the window size (typically 7), because attention is computed only within local windows of fixed size $M \times M$. This makes Swin Transformer linear in the number of patches for a fixed window size, enabling application to much higher resolution images.

Question 7

What is the patch merging operation in Swin Transformer, and what purpose does it serve?

Show Answer

Patch merging reduces the spatial resolution by 2x while increasing the channel dimension by 2x, creating a hierarchical feature representation. It works by grouping patches in 2x2 spatial neighborhoods, concatenating their features (resulting in 4x the channel dimension), and then applying a linear projection to reduce the channels to 2x the original dimension. This creates multi-scale feature maps similar to CNNs, which is essential for dense prediction tasks like object detection and segmentation.

Question 8

In DETR, what is the Hungarian matching algorithm used for, and why is it necessary?

Show Answer

The Hungarian matching algorithm finds the optimal one-to-one assignment between DETR's predicted objects and ground truth objects during training. It is necessary because DETR predicts a fixed-size set of detections (e.g., 100) in parallel, and the loss must match each prediction to the appropriate ground truth object (or "no object"). Without this matching, the model would have no way to determine which prediction should be responsible for detecting which ground truth object. The matching minimizes a cost that combines classification confidence and bounding box similarity.

Question 9

What are the main advantages of DETR over traditional object detection pipelines like Faster R-CNN?

Show Answer

DETR eliminates several hand-designed components: (1) anchor boxes — no need to define anchor sizes and aspect ratios, (2) Non-Maximum Suppression (NMS) — no post-processing needed to remove duplicate detections since each object query predicts exactly one object, (3) region proposal networks — the transformer decoder directly predicts all objects. This results in a simpler, more elegant architecture with fewer hyperparameters. DETR also excels at detecting large objects due to its global self-attention mechanism.

Question 10

How does SegFormer achieve efficient self-attention for semantic segmentation?

Show Answer

SegFormer uses two key techniques: (1) **Spatial reduction attention** — it reduces the spatial resolution of keys and values by a factor $R$ before computing attention, changing the complexity from $O(N^2)$ to $O(N^2/R)$. (2) **Hierarchical architecture** — it produces multi-scale features at 1/4, 1/8, 1/16, and 1/32 resolution, so attention at higher levels operates on fewer tokens. Additionally, it uses a simple MLP decoder (instead of a heavy transformer decoder), keeping the decoding stage efficient.

Question 11

Why does ViT use pre-norm (Layer Norm before attention/FFN) instead of post-norm (Layer Norm after attention/FFN)?

Show Answer

Pre-norm places Layer Normalization before each sub-layer (attention or FFN) rather than after. This configuration has been shown to improve training stability, especially for deeper models, because it ensures that the residual connection carries the un-normalized signal, keeping the gradient magnitude more consistent across layers. Post-norm can lead to training instability and requires careful learning rate warmup. Pre-norm also allows the model to effectively skip layers via the residual path when a layer's contribution is small.

Question 12

What is the relative position bias in Swin Transformer, and how does it differ from ViT's position embeddings?

Show Answer

Swin Transformer uses a **relative position bias** — a learnable bias term added to the attention logits based on the relative spatial distance between tokens, parameterized by a table of size $(2M-1) \times (2M-1)$. This differs from ViT's **absolute position embeddings**, which are learnable vectors added to each patch embedding based on its absolute position. Relative position bias has two advantages: (1) it generalizes better to different input sizes since it only depends on relative distances, and (2) it provides approximately 1-2% accuracy improvement on ImageNet.

Question 13

Explain the concept of attention distance in ViT and how it relates to CNN receptive fields.

Show Answer

Attention distance measures the average spatial distance between a query token and the tokens it attends to (weighted by attention weights). In early ViT layers, the mean attention distance is small (indicating local attention similar to small convolutional kernels), while in later layers, it increases dramatically (indicating global attention). This is analogous to how CNNs build up receptive fields gradually through stacking layers, but ViTs have the flexibility to attend globally at any layer. The key difference is that ViTs can dynamically adjust their attention distance based on the input, while CNN receptive fields are fixed by architecture.

Question 14

How does Masked Autoencoder (MAE) pre-training differ from BERT's masked token pre-training, and why?

Show Answer

MAE masks 75% of image patches (vs. BERT's 15% of text tokens) because images have much higher spatial redundancy than text — neighboring patches contain highly correlated information, so reconstructing a missing patch from its neighbors is too easy without a high masking ratio. Additionally, MAE's encoder only processes the unmasked patches (improving efficiency by 3-4x), while BERT processes all tokens including masked ones. MAE reconstructs raw pixel values, while BERT predicts discrete token IDs. This design creates a sufficiently challenging pre-training task that forces the model to learn semantic representations.

Question 15

What is the purpose of the distillation token in DeiT, and how does it work?

Show Answer

The distillation token is a learnable embedding (similar to the [CLS] token) added to the input sequence. While the [CLS] token is trained with the ground truth label using cross-entropy loss, the distillation token is trained to match the predictions of a CNN teacher model. During inference, predictions from both tokens can be averaged (ensembled) for improved performance. The distillation token learns complementary representations — its predictions agree with the [CLS] token only about 60% of the time, yet combining them improves accuracy, suggesting the token captures different aspects of the input.

Question 16

Why do ViTs tend to be more robust to distribution shift than CNNs?

Show Answer

ViTs develop more shape-based representations compared to CNNs, which tend to rely heavily on texture cues. This is because self-attention computes relationships between distant patches, naturally encouraging the model to learn structural (shape) information rather than local texture patterns. Since shape is a more semantically meaningful cue than texture, ViTs generalize better to distribution shifts such as style changes, corruptions, and domain shifts. However, ViTs are more sensitive to patch-level corruptions since each corrupted patch directly affects the attention computation.

Question 17

How does DETR's transformer decoder generate object detections, and what role does cross-attention play?

Show Answer

DETR's decoder takes a fixed set of $N$ learnable object queries as input. These queries go through multiple decoder layers, each containing: (1) self-attention between queries (allowing them to communicate and avoid duplicate predictions), and (2) cross-attention where queries attend to the encoder's image features. Cross-attention is where each query "looks" at the image to find the object it should detect — it learns to attend to specific spatial regions and object parts. After all decoder layers, each query's output is independently passed through prediction heads (FFNs) to produce class probabilities and bounding box coordinates.

Question 18

What is ConvNeXt, and what does it tell us about the CNN vs. ViT debate?

Show Answer

ConvNeXt is a modernized CNN architecture that applies design principles from Vision Transformers (larger kernels, fewer normalization layers, GELU activation, inverted bottleneck, layer-wise scaling) while remaining a pure convolutional network. It achieves performance competitive with Swin Transformer across classification, detection, and segmentation tasks. This demonstrates that much of ViT's advantage comes from improved training recipes and architectural design choices rather than self-attention itself. It suggests the optimal vision architecture may combine ideas from both paradigms rather than being purely one or the other.

Question 19

Explain how SAM (Segment Anything Model) achieves zero-shot segmentation for arbitrary objects.

Show Answer

SAM uses three components: (1) A large pre-trained ViT-H image encoder that generates rich image embeddings (computed once per image). (2) A prompt encoder that encodes user-provided prompts — points, bounding boxes, or masks — into embeddings. (3) A lightweight mask decoder that combines image and prompt embeddings through cross-attention to predict segmentation masks. SAM achieves zero-shot generalization by being trained on over 1 billion masks spanning diverse objects and domains. The prompt-based architecture allows it to segment any object by specifying what to segment through prompts, without needing to be trained on specific object categories.

Question 20

How would you adapt a ViT pre-trained at 224x224 resolution for fine-tuning at 384x384 resolution?

Show Answer

The main challenge is that the number of patches changes (from 196 to 576 for patch size 16), so the pre-trained position embeddings have the wrong size. The solution is: (1) Keep all model weights except position embeddings. (2) Reshape the 1D position embeddings (excluding [CLS]) into a 2D grid matching the pre-training resolution (14x14). (3) Apply bicubic interpolation to upsample this 2D grid to the new resolution (24x24). (4) Flatten back to 1D and prepend the original [CLS] position embedding. This preserves the learned spatial structure while adapting to the new resolution. Fine-tuning at higher resolution typically improves accuracy by 1-3%.

Question 21

What is the computational bottleneck when applying standard ViT to high-resolution images, and what solutions exist?

Show Answer

The bottleneck is the quadratic complexity of self-attention: $O(N^2 d)$ where $N$ is the number of patches. Doubling the image resolution quadruples $N$ and increases attention cost by 16x. Solutions include: (1) **Window attention** (Swin Transformer) — compute attention within local windows for $O(NM^2)$ cost. (2) **Spatial reduction** (SegFormer) — downsample keys/values before attention. (3) **Token pruning/merging** — reduce sequence length by removing or combining less important tokens. (4) **Linear attention** — approximate softmax attention with kernel methods for $O(Nd^2)$ cost. (5) **FlashAttention** — IO-aware implementation that doesn't reduce FLOPs but dramatically improves wall-clock time and memory usage.

Question 22

Why does DETR struggle with detecting small objects, and how has this been addressed?

Show Answer

DETR struggles with small objects for two reasons: (1) The CNN backbone typically produces features at 1/32 resolution, which loses fine-grained spatial information needed for small objects. (2) Global self-attention in the encoder treats all spatial locations equally, making it difficult to focus on small regions. This has been addressed by: **Deformable DETR**, which uses deformable attention that attends to a small set of learned sampling points rather than all locations, enabling multi-scale feature processing and faster convergence. **DINO** (the detection model, not the self-supervised method) added contrastive denoising training and mixed query selection for further improvements.

Question 23

Compare layer-wise learning rate decay and full fine-tuning for ViT. When is each approach preferred?

Show Answer

**Layer-wise learning rate decay** assigns progressively smaller learning rates to earlier layers (e.g., multiplying by a factor of 0.65-0.75 per layer). This preserves the general features in early layers while allowing more adaptation in later layers. It is preferred when: (a) the target dataset is small, (b) the target domain is similar to the pre-training domain, or (c) overfitting is a concern. **Full fine-tuning** (uniform learning rate for all layers) allows maximum adaptation and is preferred when: (a) the target dataset is large, (b) the target domain is very different from pre-training, or (c) the model capacity is small relative to the dataset.

Question 24

What is DINO's self-distillation approach, and what surprising properties do the learned features exhibit?

Show Answer

DINO trains a student ViT to match the output of a teacher ViT (an exponential moving average of the student) on different augmented views of the same image, using a cross-entropy loss on centered, sharpened softmax outputs. No labels are used. The surprising properties of DINO features include: (1) Attention maps in the last layer produce accurate object segmentation without any segmentation supervision. (2) The [CLS] token features work remarkably well for k-NN classification (78.3% on ImageNet without any fine-tuning). (3) The features naturally encode semantic structure that enables effective image retrieval. These properties emerge purely from the self-supervised training objective.

Question 25

Describe the architectural differences between ViT-Base, Swin-Base, and SegFormer-B4. How does each design choice reflect the intended use case?

Show Answer

**ViT-Base**: Single-scale, global self-attention over all patches. 12 layers, 768-dim, 86M params. Designed for classification — produces a single [CLS] representation. Simple but limited to moderate resolutions. **Swin-Base**: Hierarchical, shifted-window attention. 4 stages with patch merging, producing feature maps at 1/4, 1/8, 1/16, 1/32 resolution. 128-dim base, 88M params. Designed as a general-purpose backbone — multi-scale features support detection and segmentation via FPN. **SegFormer-B4**: Hierarchical with overlapping patch embeddings and Mix-FFN (depthwise convolution in FFN). Efficient self-attention via spatial reduction of keys/values. Designed for dense prediction — paired with a lightweight MLP decoder. Overlapping patches and Mix-FFN provide positional awareness without explicit position embeddings. Each design reflects its use case: ViT-Base for global understanding (classification), Swin-Base for general-purpose feature extraction, SegFormer-B4 for efficient dense prediction.