Case Study 2: Training with FSDP and DeepSpeed

Overview

A research team needs to fine-tune a 13B parameter language model on domain-specific data. The model requires approximately 52 GB in float32 for parameters alone, far exceeding single-GPU memory. The team has access to 4 A100-80GB GPUs on a single node and must choose between FSDP and DeepSpeed for memory-efficient training.

Problem Statement

Training a 13B model with AdamW requires: - Parameters: 13B * 4 bytes = 52 GB - Gradients: 52 GB - Adam states: 104 GB (2 copies of parameters) - Total (excluding activations): 208 GB

This exceeds the combined 320 GB across 4 A100-80GB GPUs when accounting for activation memory and framework overhead.

Approach

Configuration 1: FSDP

PyTorch FSDP with full sharding (equivalent to ZeRO Stage 3): - Shard parameters, gradients, and optimizer states across 4 GPUs - Mixed precision: compute in BF16, reduce in FP32, store in BF16 - Activation checkpointing on every transformer block - Per-block FSDP wrapping

Memory per GPU (FSDP): - Sharded parameters: 52 GB / 4 = 13 GB (stored in BF16: 6.5 GB) - Sharded gradients: 13 GB (in BF16: 6.5 GB) - Sharded optimizer states: 26 GB (in FP32) - Activations (with checkpointing): ~8 GB - Total: ~47 GB per GPU (fits in 80 GB)

Configuration 2: DeepSpeed ZeRO Stage 3

DeepSpeed with ZeRO-3 and CPU offloading: - ZeRO Stage 3 parameter sharding - Optimizer state offloading to CPU - Mixed precision with FP16

Memory per GPU (DeepSpeed + offload): - Sharded parameters: 6.5 GB (BF16) - Sharded gradients: 6.5 GB - Optimizer states: offloaded to CPU RAM - Activations: ~8 GB - Total GPU: ~21 GB per GPU

Training Configuration

Setting	FSDP	DeepSpeed
Precision	BF16 compute, FP32 reduce	FP16 with loss scaling
Batch size (per GPU)	2	4
Gradient accumulation	4 steps	2 steps
Effective batch size	32	32
Learning rate	2e-5	2e-5
Warmup steps	100	100
Activation checkpointing	Yes	Yes

Results

Metric	FSDP	DeepSpeed ZeRO-3	DeepSpeed + Offload
GPU memory per device	47 GB	44 GB	21 GB
Training throughput	820 tokens/s	790 tokens/s	420 tokens/s
Time per epoch (10K steps)	3.4 hours	3.5 hours	6.6 hours
Validation loss	1.82	1.83	1.83
Setup complexity	Medium	Medium-High	High

Throughput Breakdown

Component	FSDP	DeepSpeed ZeRO-3
Forward pass	35%	34%
Backward pass	40%	39%
All-gather (params)	12%	13%
Reduce-scatter (grads)	8%	9%
Optimizer step	5%	5%

Key Lessons

FSDP and DeepSpeed ZeRO-3 achieve comparable performance. Without CPU offloading, both solutions provide similar throughput and memory efficiency. The choice depends on ecosystem preferences (PyTorch-native vs DeepSpeed library).
CPU offloading trades throughput for memory. DeepSpeed with offloading reduced GPU memory by 55% but halved training throughput. This is worthwhile when GPU memory is the binding constraint (e.g., larger models or fewer GPUs).
Activation checkpointing is essential for large models. Without it, activation memory exceeded GPU capacity even with parameter sharding. Checkpointing added ~30% compute overhead but made training feasible.
BF16 is preferred over FP16 for large models. BF16's larger dynamic range eliminated the need for loss scaling and gradient overflow handling, simplifying the training loop.
Per-block FSDP wrapping optimizes the memory-communication trade-off. Wrapping at the transformer block level (rather than per-layer or whole-model) provided the best balance: sufficient sharding for memory savings without excessive all-gather communication.
Effective batch size should be kept constant across configurations. When comparing FSDP and DeepSpeed, keeping the effective batch size at 32 (via different per-GPU batch sizes and gradient accumulation steps) ensured comparable training dynamics and final loss.

Code Reference

The complete implementation is available in code/case-study-code.py.