Case Study 2: Object Detection with DETR
Overview
Object detection -- the task of identifying and localizing objects within images -- has traditionally relied on complex, multi-stage pipelines with hand-designed components such as anchor boxes, region proposal networks, and non-maximum suppression. In this case study, you will learn how to use DETR (DEtection TRansformer) to build a streamlined, end-to-end object detection system for a custom dataset, demonstrating the elegance and practical effectiveness of transformer-based detection.
Problem Statement
You are building an automated inventory monitoring system for a retail warehouse. The system must detect and localize products on warehouse shelves from overhead camera images, identifying objects from 8 product categories: boxes, bottles, cans, bags, jars, tubes, cartons, and pouches. The dataset contains approximately 3,200 annotated images with bounding boxes in COCO format.
Dataset
The dataset follows the COCO annotation format:
warehouse_detection/
train/
images/ (~2,400 images)
annotations.json
val/
images/ (~400 images)
annotations.json
test/
images/ (~400 images)
annotations.json
Each annotation file contains bounding boxes in [x_min, y_min, width, height] format, category IDs, and image metadata.
Approach
Step 1: Data Analysis
Before training, thorough data analysis reveals key characteristics:
- Object density: Average of 12 objects per image, with some images containing up to 40.
- Size distribution: Objects range from 32x32 pixels (small cans) to 200x150 pixels (large boxes) in 640x480 images.
- Class imbalance: Boxes and bottles account for 55% of annotations; tubes and pouches together account for only 8%.
- Occlusion: Approximately 20% of objects are partially occluded by neighboring products.
Step 2: Model Selection
We choose DETR with a ResNet-50 backbone pre-trained on COCO. Key configuration decisions:
- Number of object queries: 50 (sufficient for our maximum object count of ~40 per image).
- Backbone: ResNet-50 pre-trained on ImageNet, with the final stage producing features at 1/32 resolution.
- Transformer: 6 encoder layers, 6 decoder layers, 256 hidden dimension, 8 attention heads.
- Image resolution: 640x640 (resized with padding to maintain aspect ratio).
Step 3: Training Strategy
DETR is known for slow convergence. We address this with a carefully designed training recipe:
- Pre-trained initialization: Start from the COCO-pre-trained DETR checkpoint rather than training from scratch.
- Learning rate: 1e-5 for the backbone, 1e-4 for the transformer and prediction heads.
- Optimizer: AdamW with weight decay 1e-4.
- Scheduler: Step LR with decay factor 0.1 at epoch 40.
- Epochs: 50 (reduced from the standard 500 due to pre-trained initialization).
- Batch size: 4 per GPU (DETR is memory-intensive due to bipartite matching).
- Data augmentation: Random horizontal flip, random resize (scales 0.8-1.2), color jitter.
Step 4: Loss Configuration
DETR uses Hungarian matching to assign predictions to ground truth objects. The loss combines:
- Classification loss: Cross-entropy with class weight 1.0 for objects and 0.1 for the "no object" class.
- L1 box loss: Absolute difference between predicted and ground truth bounding boxes, weighted by 5.0.
- GIoU loss: Generalized Intersection over Union for scale-invariant box regression, weighted by 2.0.
Step 5: Inference and Post-Processing
At inference time, DETR produces 50 predictions per image. Post-processing involves:
- Filter predictions by confidence threshold (0.7 for high-precision applications).
- No NMS needed -- each object query naturally specializes in different objects.
- Convert box coordinates from normalized [cx, cy, w, h] to absolute [x_min, y_min, x_max, y_max].
Results
Detection Performance
| Metric | Value |
|---|---|
| mAP@0.5 | 0.78 |
| mAP@0.75 | 0.61 |
| mAP@[0.5:0.95] | 0.52 |
| AR@10 | 0.68 |
| AR@50 | 0.71 |
Per-Class AP@0.5
| Category | AP@0.5 | Count in Dataset |
|---|---|---|
| Boxes | 0.86 | 8,400 |
| Bottles | 0.82 | 6,200 |
| Cans | 0.79 | 4,100 |
| Bags | 0.77 | 3,600 |
| Jars | 0.74 | 2,800 |
| Cartons | 0.73 | 2,300 |
| Tubes | 0.68 | 1,500 |
| Pouches | 0.65 | 1,100 |
Object Size Analysis
| Size Category | mAP@0.5 |
|---|---|
| Small (<48px) | 0.42 |
| Medium (48-128px) | 0.74 |
| Large (>128px) | 0.89 |
DETR's known weakness with small objects is apparent here. The global self-attention mechanism excels at large objects but struggles with fine-grained localization of small ones.
Object Query Specialization
Visualizing the attention patterns of individual object queries reveals that they naturally specialize:
- Queries 1-10 tend to detect objects in the top-left quadrant.
- Queries 11-20 focus on the center of the image.
- Queries 30-40 specialize in large objects regardless of position.
- Queries 41-50 rarely activate (unused capacity).
This emergent specialization mirrors the behavior reported in the original DETR paper.
Key Lessons
-
Pre-training on COCO is essential. Training DETR from scratch required 300+ epochs on our dataset and achieved 15% lower mAP compared to fine-tuning from the COCO checkpoint in only 50 epochs.
-
Small object detection remains challenging. DETR's global attention at 1/32 resolution loses fine spatial detail. For production use, we recommend Deformable DETR or combining DETR with a multi-scale feature pyramid for small object detection.
-
The number of queries must exceed the maximum object count. Setting queries to 30 (below our maximum of ~40) caused the model to miss objects in dense scenes. Setting queries to 100 wasted compute without improving accuracy.
-
No NMS simplifies the pipeline. Eliminating non-maximum suppression removed a source of hyperparameter tuning and edge-case failures, making the system more robust in production.
-
Class imbalance affects detection quality. The 5x imbalance between the most and least common classes resulted in a 21-point AP gap. Oversampling rare classes during training reduced this gap to 14 points.
-
DETR inference is slower than anchor-based detectors. At 28 FPS on an A100 GPU (versus 45 FPS for Faster R-CNN), DETR's transformer decoder adds latency. For real-time applications, consider DINO or RT-DETR variants.
Comparison with Faster R-CNN
To contextualize the results, we trained a Faster R-CNN (ResNet-50 + FPN) baseline on the same dataset:
| Metric | DETR | Faster R-CNN |
|---|---|---|
| mAP@0.5 | 0.78 | 0.76 |
| mAP@[0.5:0.95] | 0.52 | 0.49 |
| Small object AP | 0.42 | 0.51 |
| Large object AP | 0.89 | 0.82 |
| Inference FPS | 28 | 45 |
| NMS required | No | Yes |
| Duplicate detections | Rare | Frequent without NMS |
DETR excels at large objects and produces cleaner predictions (no duplicates), while Faster R-CNN is faster and better at small objects.
Production Deployment Considerations
- Latency: For real-time monitoring at 30 FPS, consider RT-DETR or model distillation.
- Batch processing: DETR benefits from batched inference; process multiple camera frames together.
- Confidence calibration: DETR confidence scores tend to be overconfident. Apply temperature scaling on the validation set.
- Model updates: As new product categories are introduced, fine-tune with the existing model as initialization rather than retraining from scratch.
Code Reference
The complete implementation is available in code/case-study-code.py.