Chapter 35: Further Reading

Data Parallelism

Li, S., Zhao, Y., Varma, R., et al. (2020). "PyTorch Distributed: Experiences on Accelerating Data Parallel Training." VLDB Endowment, 13(12), 3005--3018. The official PyTorch DDP paper describing the design, optimizations (gradient bucketing, communication overlap), and performance of DistributedDataParallel.
Goyal, P., Dollar, P., Girshick, R., et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." arXiv preprint arXiv:1706.02677. The foundational paper on linear learning rate scaling and warmup for large-batch distributed training.
You, Y., Gitman, I., & Ginsburg, B. (2017). "Large Batch Training of Convolutional Networks." arXiv preprint arXiv:1708.03888. Introduces LARS (Layer-wise Adaptive Rate Scaling), addressing training instability at very large batch sizes.

Zhao, Y., Gu, A., Varma, R., et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." VLDB Endowment, 16(12), 3848--3860. The official FSDP paper describing the PyTorch implementation, wrapping strategies, and performance at scale.
Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." SC20: International Conference for High Performance Computing. The foundational ZeRO paper introducing the three stages of optimizer state, gradient, and parameter sharding.
Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., & He, Y. (2021). "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning." SC21. Extends ZeRO to offload to NVMe storage, enabling training of models with trillions of parameters.

Narayanan, D., Shoeybi, M., Casper, J., et al. (2021). "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM." SC21. Describes the 3D parallelism approach (data + tensor + pipeline) used to train models up to 1 trillion parameters.
Huang, Y., Cheng, Y., Bapna, A., et al. (2019). "GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism." NeurIPS 2019. Introduces micro-batching for pipeline parallelism, reducing the pipeline bubble and enabling efficient training.
Shoeybi, M., Patwary, M., Puri, R., et al. (2020). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv preprint arXiv:1909.08053. Details tensor parallelism for transformer models, with column-parallel and row-parallel linear layer partitioning.

Micikevicius, P., Narang, S., Alben, J., et al. (2018). "Mixed Precision Training." ICLR 2018. The foundational paper on training with float16 using loss scaling to prevent gradient underflow.
Kalamkar, D., Mudigere, D., Mellempudi, N., et al. (2019). "A Study of BFLOAT16 for Deep Learning Training." arXiv preprint arXiv:1905.12322. Demonstrates that BFloat16 matches Float32 training quality without loss scaling due to its larger dynamic range.

Chan, E., Heimlich, M., Purkayastha, A., & van de Geijn, R. (2007). "Collective Communication: Theory, Practice, and Experience." Concurrency and Computation: Practice and Experience, 19(13), 1749--1783. Comprehensive analysis of collective communication algorithms including ring-based approaches.
NVIDIA. (2024). "NCCL Documentation." Available at: https://docs.nvidia.com/deeplearning/nccl. Official documentation for NVIDIA Collective Communications Library, the standard backend for GPU distributed training.

HuggingFace. (2024). "Accelerate Documentation." Available at: https://huggingface.co/docs/accelerate. Documentation for HuggingFace Accelerate, which simplifies distributed training with DDP, FSDP, and DeepSpeed.
Microsoft. (2024). "DeepSpeed Documentation." Available at: https://www.deepspeed.ai/docs. Comprehensive documentation for DeepSpeed including ZeRO configurations, offloading, and inference optimization.
Weng, L. (2023). "Large Transformer Model Inference Optimization." Lil'Log blog. An excellent overview of optimization techniques for both training and inference of large transformer models.