Case Study 2: Object Detection with DETR

Overview

Object detection -- the task of identifying and localizing objects within images -- has traditionally relied on complex, multi-stage pipelines with hand-designed components such as anchor boxes, region proposal networks, and non-maximum suppression. In this case study, you will learn how to use DETR (DEtection TRansformer) to build a streamlined, end-to-end object detection system for a custom dataset, demonstrating the elegance and practical effectiveness of transformer-based detection.

Problem Statement

You are building an automated inventory monitoring system for a retail warehouse. The system must detect and localize products on warehouse shelves from overhead camera images, identifying objects from 8 product categories: boxes, bottles, cans, bags, jars, tubes, cartons, and pouches. The dataset contains approximately 3,200 annotated images with bounding boxes in COCO format.

Dataset

The dataset follows the COCO annotation format:

warehouse_detection/
    train/
        images/        (~2,400 images)
        annotations.json
    val/
        images/        (~400 images)
        annotations.json
    test/
        images/        (~400 images)
        annotations.json

Each annotation file contains bounding boxes in [x_min, y_min, width, height] format, category IDs, and image metadata.

Approach

Step 1: Data Analysis

Before training, thorough data analysis reveals key characteristics:

  • Object density: Average of 12 objects per image, with some images containing up to 40.
  • Size distribution: Objects range from 32x32 pixels (small cans) to 200x150 pixels (large boxes) in 640x480 images.
  • Class imbalance: Boxes and bottles account for 55% of annotations; tubes and pouches together account for only 8%.
  • Occlusion: Approximately 20% of objects are partially occluded by neighboring products.

Step 2: Model Selection

We choose DETR with a ResNet-50 backbone pre-trained on COCO. Key configuration decisions:

  • Number of object queries: 50 (sufficient for our maximum object count of ~40 per image).
  • Backbone: ResNet-50 pre-trained on ImageNet, with the final stage producing features at 1/32 resolution.
  • Transformer: 6 encoder layers, 6 decoder layers, 256 hidden dimension, 8 attention heads.
  • Image resolution: 640x640 (resized with padding to maintain aspect ratio).

Step 3: Training Strategy

DETR is known for slow convergence. We address this with a carefully designed training recipe:

  • Pre-trained initialization: Start from the COCO-pre-trained DETR checkpoint rather than training from scratch.
  • Learning rate: 1e-5 for the backbone, 1e-4 for the transformer and prediction heads.
  • Optimizer: AdamW with weight decay 1e-4.
  • Scheduler: Step LR with decay factor 0.1 at epoch 40.
  • Epochs: 50 (reduced from the standard 500 due to pre-trained initialization).
  • Batch size: 4 per GPU (DETR is memory-intensive due to bipartite matching).
  • Data augmentation: Random horizontal flip, random resize (scales 0.8-1.2), color jitter.

Step 4: Loss Configuration

DETR uses Hungarian matching to assign predictions to ground truth objects. The loss combines:

  • Classification loss: Cross-entropy with class weight 1.0 for objects and 0.1 for the "no object" class.
  • L1 box loss: Absolute difference between predicted and ground truth bounding boxes, weighted by 5.0.
  • GIoU loss: Generalized Intersection over Union for scale-invariant box regression, weighted by 2.0.

Step 5: Inference and Post-Processing

At inference time, DETR produces 50 predictions per image. Post-processing involves:

  1. Filter predictions by confidence threshold (0.7 for high-precision applications).
  2. No NMS needed -- each object query naturally specializes in different objects.
  3. Convert box coordinates from normalized [cx, cy, w, h] to absolute [x_min, y_min, x_max, y_max].

Results

Detection Performance

Metric Value
mAP@0.5 0.78
mAP@0.75 0.61
mAP@[0.5:0.95] 0.52
AR@10 0.68
AR@50 0.71

Per-Class AP@0.5

Category AP@0.5 Count in Dataset
Boxes 0.86 8,400
Bottles 0.82 6,200
Cans 0.79 4,100
Bags 0.77 3,600
Jars 0.74 2,800
Cartons 0.73 2,300
Tubes 0.68 1,500
Pouches 0.65 1,100

Object Size Analysis

Size Category mAP@0.5
Small (<48px) 0.42
Medium (48-128px) 0.74
Large (>128px) 0.89

DETR's known weakness with small objects is apparent here. The global self-attention mechanism excels at large objects but struggles with fine-grained localization of small ones.

Object Query Specialization

Visualizing the attention patterns of individual object queries reveals that they naturally specialize:

  • Queries 1-10 tend to detect objects in the top-left quadrant.
  • Queries 11-20 focus on the center of the image.
  • Queries 30-40 specialize in large objects regardless of position.
  • Queries 41-50 rarely activate (unused capacity).

This emergent specialization mirrors the behavior reported in the original DETR paper.

Key Lessons

  1. Pre-training on COCO is essential. Training DETR from scratch required 300+ epochs on our dataset and achieved 15% lower mAP compared to fine-tuning from the COCO checkpoint in only 50 epochs.

  2. Small object detection remains challenging. DETR's global attention at 1/32 resolution loses fine spatial detail. For production use, we recommend Deformable DETR or combining DETR with a multi-scale feature pyramid for small object detection.

  3. The number of queries must exceed the maximum object count. Setting queries to 30 (below our maximum of ~40) caused the model to miss objects in dense scenes. Setting queries to 100 wasted compute without improving accuracy.

  4. No NMS simplifies the pipeline. Eliminating non-maximum suppression removed a source of hyperparameter tuning and edge-case failures, making the system more robust in production.

  5. Class imbalance affects detection quality. The 5x imbalance between the most and least common classes resulted in a 21-point AP gap. Oversampling rare classes during training reduced this gap to 14 points.

  6. DETR inference is slower than anchor-based detectors. At 28 FPS on an A100 GPU (versus 45 FPS for Faster R-CNN), DETR's transformer decoder adds latency. For real-time applications, consider DINO or RT-DETR variants.

Comparison with Faster R-CNN

To contextualize the results, we trained a Faster R-CNN (ResNet-50 + FPN) baseline on the same dataset:

Metric DETR Faster R-CNN
mAP@0.5 0.78 0.76
mAP@[0.5:0.95] 0.52 0.49
Small object AP 0.42 0.51
Large object AP 0.89 0.82
Inference FPS 28 45
NMS required No Yes
Duplicate detections Rare Frequent without NMS

DETR excels at large objects and produces cleaner predictions (no duplicates), while Faster R-CNN is faster and better at small objects.

Production Deployment Considerations

  • Latency: For real-time monitoring at 30 FPS, consider RT-DETR or model distillation.
  • Batch processing: DETR benefits from batched inference; process multiple camera frames together.
  • Confidence calibration: DETR confidence scores tend to be overconfident. Apply temperature scaling on the validation set.
  • Model updates: As new product categories are introduced, fine-tune with the existing model as initialization rather than retraining from scratch.

Code Reference

The complete implementation is available in code/case-study-code.py.