Case Study 1: Scaling Training with DDP
Overview
A computer vision team needs to train a ViT-Large model on an internal dataset of 2 million images for 50 epochs. Single-GPU training takes 12 days on an A100. The goal is to reduce training time to under 2 days using data parallelism while maintaining model accuracy within 0.3% of the single-GPU baseline.
Problem Statement
The team faces several practical challenges:
- Training time: 12 days is too slow for iterative experimentation.
- Batch size sensitivity: ViT-Large is sensitive to batch size -- increasing batch size without proper learning rate adjustment degrades accuracy.
- Infrastructure: The team has access to a cluster with 8 A100 GPUs across 2 nodes (4 GPUs per node).
Approach
Step 1: Baseline Measurement
Single-GPU baseline on one A100-80GB: - Batch size: 64 - Learning rate: 1e-4 (AdamW) - Training throughput: 150 images/second - Peak accuracy: 87.2% - Training time: 12.1 days
Step 2: DDP Configuration
The team uses PyTorch DistributedDataParallel with: - Backend: NCCL - 8 GPUs across 2 nodes (4 per node, connected by InfiniBand) - DistributedSampler for data sharding
Step 3: Batch Size and Learning Rate Scaling
With 8 GPUs, the effective batch size becomes 64 * 8 = 512. The learning rate is scaled using the linear rule with warmup: - Base LR: 1e-4 (for batch size 64) - Scaled LR: 8e-4 (for batch size 512) - Warmup: 5 epochs (linear warmup from 1e-5 to 8e-4)
Step 4: Communication Optimization
- Gradient bucketing: 25MB buckets to overlap communication with computation
- FP16 gradient reduction: Reduces communication volume by 2x
- Process group: Separate groups for intra-node and inter-node communication
Results
| Configuration | GPUs | Batch Size | LR | Throughput | Time | Accuracy |
|---|---|---|---|---|---|---|
| Baseline | 1 | 64 | 1e-4 | 150 img/s | 12.1 days | 87.2% |
| DDP (no scaling) | 8 | 512 | 1e-4 | 1,080 img/s | 1.7 days | 84.8% |
| DDP + linear LR | 8 | 512 | 8e-4 | 1,080 img/s | 1.7 days | 86.5% |
| DDP + warmup | 8 | 512 | 8e-4 | 1,080 img/s | 1.7 days | 87.0% |
| DDP + grad accum | 8 | 512 (4*128) | 8e-4 | 1,020 img/s | 1.8 days | 87.1% |
Scaling efficiency: 1,080 / (150 * 8) = 90%
Key Lessons
-
Linear learning rate scaling is essential. Without adjusting the learning rate, accuracy dropped by 2.4%. The linear scaling rule with warmup recovered nearly all of the accuracy gap.
-
Warmup is critical for large batch training. The first few epochs with a high learning rate caused training instability. A 5-epoch warmup stabilized training and improved final accuracy by 0.5%.
-
90% scaling efficiency is achievable with NCCL. The 10% overhead comes from gradient synchronization (7%) and data loading overhead (3%).
-
Gradient accumulation provides a small accuracy benefit. Using gradient accumulation (4 micro-batches of 128 instead of one batch of 512) slightly improved accuracy, likely due to more frequent parameter updates.
-
Inter-node communication is the bottleneck. Profiling showed that 65% of communication time was inter-node (InfiniBand) while 35% was intra-node (NVLink). Gradient compression could further improve inter-node efficiency.
Code Reference
The complete implementation is available in code/case-study-code.py.