Case Study 1: Scaling Training with DDP

Overview

A computer vision team needs to train a ViT-Large model on an internal dataset of 2 million images for 50 epochs. Single-GPU training takes 12 days on an A100. The goal is to reduce training time to under 2 days using data parallelism while maintaining model accuracy within 0.3% of the single-GPU baseline.

Problem Statement

The team faces several practical challenges:

Training time: 12 days is too slow for iterative experimentation.
Batch size sensitivity: ViT-Large is sensitive to batch size -- increasing batch size without proper learning rate adjustment degrades accuracy.
Infrastructure: The team has access to a cluster with 8 A100 GPUs across 2 nodes (4 GPUs per node).

Approach

Step 1: Baseline Measurement

Single-GPU baseline on one A100-80GB: - Batch size: 64 - Learning rate: 1e-4 (AdamW) - Training throughput: 150 images/second - Peak accuracy: 87.2% - Training time: 12.1 days

Step 2: DDP Configuration

The team uses PyTorch DistributedDataParallel with: - Backend: NCCL - 8 GPUs across 2 nodes (4 per node, connected by InfiniBand) - DistributedSampler for data sharding

Step 3: Batch Size and Learning Rate Scaling

With 8 GPUs, the effective batch size becomes 64 * 8 = 512. The learning rate is scaled using the linear rule with warmup: - Base LR: 1e-4 (for batch size 64) - Scaled LR: 8e-4 (for batch size 512) - Warmup: 5 epochs (linear warmup from 1e-5 to 8e-4)

Step 4: Communication Optimization

Gradient bucketing: 25MB buckets to overlap communication with computation
FP16 gradient reduction: Reduces communication volume by 2x
Process group: Separate groups for intra-node and inter-node communication

Results

Configuration	GPUs	Batch Size	LR	Throughput	Time	Accuracy
Baseline	1	64	1e-4	150 img/s	12.1 days	87.2%
DDP (no scaling)	8	512	1e-4	1,080 img/s	1.7 days	84.8%
DDP + linear LR	8	512	8e-4	1,080 img/s	1.7 days	86.5%
DDP + warmup	8	512	8e-4	1,080 img/s	1.7 days	87.0%
DDP + grad accum	8	512 (4*128)	8e-4	1,020 img/s	1.8 days	87.1%

Scaling efficiency: 1,080 / (150 * 8) = 90%

Key Lessons

Linear learning rate scaling is essential. Without adjusting the learning rate, accuracy dropped by 2.4%. The linear scaling rule with warmup recovered nearly all of the accuracy gap.
Warmup is critical for large batch training. The first few epochs with a high learning rate caused training instability. A 5-epoch warmup stabilized training and improved final accuracy by 0.5%.
90% scaling efficiency is achievable with NCCL. The 10% overhead comes from gradient synchronization (7%) and data loading overhead (3%).
Gradient accumulation provides a small accuracy benefit. Using gradient accumulation (4 micro-batches of 128 instead of one batch of 512) slightly improved accuracy, likely due to more frequent parameter updates.
Inter-node communication is the bottleneck. Profiling showed that 65% of communication time was inter-node (InfiniBand) while 35% was intra-node (NVLink). Gradient compression could further improve inter-node efficiency.

Code Reference

The complete implementation is available in code/case-study-code.py.