Chapter 35: Further Reading

Data Parallelism

  • Li, S., Zhao, Y., Varma, R., et al. (2020). "PyTorch Distributed: Experiences on Accelerating Data Parallel Training." VLDB Endowment, 13(12), 3005--3018. The official PyTorch DDP paper describing the design, optimizations (gradient bucketing, communication overlap), and performance of DistributedDataParallel.

  • Goyal, P., Dollar, P., Girshick, R., et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." arXiv preprint arXiv:1706.02677. The foundational paper on linear learning rate scaling and warmup for large-batch distributed training.

  • You, Y., Gitman, I., & Ginsburg, B. (2017). "Large Batch Training of Convolutional Networks." arXiv preprint arXiv:1708.03888. Introduces LARS (Layer-wise Adaptive Rate Scaling), addressing training instability at very large batch sizes.

FSDP and ZeRO

  • Zhao, Y., Gu, A., Varma, R., et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." VLDB Endowment, 16(12), 3848--3860. The official FSDP paper describing the PyTorch implementation, wrapping strategies, and performance at scale.

  • Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." SC20: International Conference for High Performance Computing. The foundational ZeRO paper introducing the three stages of optimizer state, gradient, and parameter sharding.

  • Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., & He, Y. (2021). "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning." SC21. Extends ZeRO to offload to NVMe storage, enabling training of models with trillions of parameters.

Model Parallelism

  • Narayanan, D., Shoeybi, M., Casper, J., et al. (2021). "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM." SC21. Describes the 3D parallelism approach (data + tensor + pipeline) used to train models up to 1 trillion parameters.

  • Huang, Y., Cheng, Y., Bapna, A., et al. (2019). "GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism." NeurIPS 2019. Introduces micro-batching for pipeline parallelism, reducing the pipeline bubble and enabling efficient training.

  • Shoeybi, M., Patwary, M., Puri, R., et al. (2020). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv preprint arXiv:1909.08053. Details tensor parallelism for transformer models, with column-parallel and row-parallel linear layer partitioning.

Mixed Precision Training

  • Micikevicius, P., Narang, S., Alben, J., et al. (2018). "Mixed Precision Training." ICLR 2018. The foundational paper on training with float16 using loss scaling to prevent gradient underflow.

  • Kalamkar, D., Mudigere, D., Mellempudi, N., et al. (2019). "A Study of BFLOAT16 for Deep Learning Training." arXiv preprint arXiv:1905.12322. Demonstrates that BFloat16 matches Float32 training quality without loss scaling due to its larger dynamic range.

Communication and Systems

  • Chan, E., Heimlich, M., Purkayastha, A., & van de Geijn, R. (2007). "Collective Communication: Theory, Practice, and Experience." Concurrency and Computation: Practice and Experience, 19(13), 1749--1783. Comprehensive analysis of collective communication algorithms including ring-based approaches.

  • NVIDIA. (2024). "NCCL Documentation." Available at: https://docs.nvidia.com/deeplearning/nccl. Official documentation for NVIDIA Collective Communications Library, the standard backend for GPU distributed training.

Practical Guides

  • HuggingFace. (2024). "Accelerate Documentation." Available at: https://huggingface.co/docs/accelerate. Documentation for HuggingFace Accelerate, which simplifies distributed training with DDP, FSDP, and DeepSpeed.

  • Microsoft. (2024). "DeepSpeed Documentation." Available at: https://www.deepspeed.ai/docs. Comprehensive documentation for DeepSpeed including ZeRO configurations, offloading, and inference optimization.

  • Weng, L. (2023). "Large Transformer Model Inference Optimization." Lil'Log blog. An excellent overview of optimization techniques for both training and inference of large transformer models.