Case Study 1: Scaling Training with DDP

Overview

A computer vision team needs to train a ViT-Large model on an internal dataset of 2 million images for 50 epochs. Single-GPU training takes 12 days on an A100. The goal is to reduce training time to under 2 days using data parallelism while maintaining model accuracy within 0.3% of the single-GPU baseline.

Problem Statement

The team faces several practical challenges:

  1. Training time: 12 days is too slow for iterative experimentation.
  2. Batch size sensitivity: ViT-Large is sensitive to batch size -- increasing batch size without proper learning rate adjustment degrades accuracy.
  3. Infrastructure: The team has access to a cluster with 8 A100 GPUs across 2 nodes (4 GPUs per node).

Approach

Step 1: Baseline Measurement

Single-GPU baseline on one A100-80GB: - Batch size: 64 - Learning rate: 1e-4 (AdamW) - Training throughput: 150 images/second - Peak accuracy: 87.2% - Training time: 12.1 days

Step 2: DDP Configuration

The team uses PyTorch DistributedDataParallel with: - Backend: NCCL - 8 GPUs across 2 nodes (4 per node, connected by InfiniBand) - DistributedSampler for data sharding

Step 3: Batch Size and Learning Rate Scaling

With 8 GPUs, the effective batch size becomes 64 * 8 = 512. The learning rate is scaled using the linear rule with warmup: - Base LR: 1e-4 (for batch size 64) - Scaled LR: 8e-4 (for batch size 512) - Warmup: 5 epochs (linear warmup from 1e-5 to 8e-4)

Step 4: Communication Optimization

  • Gradient bucketing: 25MB buckets to overlap communication with computation
  • FP16 gradient reduction: Reduces communication volume by 2x
  • Process group: Separate groups for intra-node and inter-node communication

Results

Configuration GPUs Batch Size LR Throughput Time Accuracy
Baseline 1 64 1e-4 150 img/s 12.1 days 87.2%
DDP (no scaling) 8 512 1e-4 1,080 img/s 1.7 days 84.8%
DDP + linear LR 8 512 8e-4 1,080 img/s 1.7 days 86.5%
DDP + warmup 8 512 8e-4 1,080 img/s 1.7 days 87.0%
DDP + grad accum 8 512 (4*128) 8e-4 1,020 img/s 1.8 days 87.1%

Scaling efficiency: 1,080 / (150 * 8) = 90%

Key Lessons

  1. Linear learning rate scaling is essential. Without adjusting the learning rate, accuracy dropped by 2.4%. The linear scaling rule with warmup recovered nearly all of the accuracy gap.

  2. Warmup is critical for large batch training. The first few epochs with a high learning rate caused training instability. A 5-epoch warmup stabilized training and improved final accuracy by 0.5%.

  3. 90% scaling efficiency is achievable with NCCL. The 10% overhead comes from gradient synchronization (7%) and data loading overhead (3%).

  4. Gradient accumulation provides a small accuracy benefit. Using gradient accumulation (4 micro-batches of 128 instead of one batch of 512) slightly improved accuracy, likely due to more frequent parameter updates.

  5. Inter-node communication is the bottleneck. Profiling showed that 65% of communication time was inter-node (InfiniBand) while 35% was intra-node (NVLink). Gradient compression could further improve inter-node efficiency.

Code Reference

The complete implementation is available in code/case-study-code.py.