Chapter 33: Further Reading

Foundational Papers

Quantization

  • Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers." ICLR 2023. The paper that made INT4 quantization practical for large language models. Introduces a layer-wise quantization approach based on approximate second-order information that achieves near-lossless INT4 quantization. https://arxiv.org/abs/2210.17323

  • Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. (2023). "AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024. Proposes protecting activation-salient channels during quantization by per-channel scaling, achieving comparable or better quality than GPTQ with significantly faster quantization. https://arxiv.org/abs/2306.00978

  • Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS 2023. Introduces 4-bit NormalFloat (NF4) quantization and demonstrates that fine-tuning a quantized model with LoRA adapters matches full 16-bit fine-tuning quality. Essential reading for combining quantization with fine-tuning. https://arxiv.org/abs/2305.14314

  • Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022. Identifies the "emergent outlier" phenomenon in transformer activations and proposes mixed-precision decomposition (INT8 for most features, FP16 for outlier features) for zero-degradation 8-bit inference. https://arxiv.org/abs/2208.07339

Attention and Memory Optimization

  • Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. The foundational paper for vLLM. Applies OS-style virtual memory management to KV cache, eliminating 60-80% memory waste from fragmentation. A must-read for understanding modern LLM serving. https://arxiv.org/abs/2309.06180

  • Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. Introduces IO-aware attention computation that tiles the attention operation to fit in GPU SRAM, achieving 2-4x speedup with linear memory scaling. Now standard in all production inference frameworks. https://arxiv.org/abs/2205.14135

  • Dao, T. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." ICLR 2024. Improves upon FlashAttention with better work partitioning and parallelism, achieving 50-73% of theoretical peak FLOPs on A100 GPUs. https://arxiv.org/abs/2307.08691

  • Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023. Introduces Grouped-Query Attention as a middle ground between MHA and MQA, showing that GQA achieves quality close to MHA with KV cache size close to MQA.

Speculative Decoding

  • Leviathan, Y., Kalman, M., and Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. One of two contemporaneous papers introducing speculative decoding. Provides the theoretical framework and proves that the acceptance-rejection scheme produces outputs with exactly the target model's distribution. https://arxiv.org/abs/2211.17192

  • Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv preprint arXiv:2302.01318. The other foundational speculative decoding paper, independently developed at DeepMind. Provides practical implementation guidance and empirical results.

Knowledge Distillation

  • Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." NIPS 2014 Deep Learning Workshop. The foundational paper on knowledge distillation. Introduces the concept of "dark knowledge" and the temperature-scaled softmax for extracting richer supervision from teacher models.

  • Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., et al. (2023). "Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes." ACL 2023. Shows that extracting rationales (chain-of-thought explanations) from the teacher and using them as additional supervision can produce students that outperform the teacher on specific tasks.

  • Gu, Y., Dong, L., Wei, F., and Huang, M. (2024). "MiniLLM: Knowledge Distillation of Large Language Models." ICLR 2024. Addresses the challenge of sequence-level distillation for auto-regressive models, proposing a reverse KL divergence objective that prevents the student from overestimating low-probability sequences.

Pruning

  • Frantar, E. and Alistarh, D. (2023). "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot." ICML 2023. Extends the GPTQ framework to pruning, achieving 50-60% unstructured sparsity with minimal accuracy loss on models up to 175B parameters. https://arxiv.org/abs/2301.00774

  • Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. (2024). "A Simple and Effective Pruning Approach for Large Language Models." ICLR 2024. Introduces Wanda (Weights and Activations), a pruning method that considers both weight magnitude and activation magnitude, achieving competitive results without weight reconstruction.

Serving Frameworks Documentation

  • vLLM Documentation: https://docs.vllm.ai/ --- Comprehensive documentation for vLLM including model support, configuration options, distributed inference, and production deployment guides.

  • NVIDIA TensorRT-LLM: https://nvidia.github.io/TensorRT-LLM/ --- Documentation for TensorRT-LLM including model compilation, quantization, parallelism, and benchmarking tools.

  • Hugging Face Text Generation Inference (TGI): https://huggingface.co/docs/text-generation-inference/ --- Documentation for TGI including model deployment, configuration, and integration with the HuggingFace ecosystem.

  • SGLang: https://sgl-project.github.io/ --- Documentation for SGLang including RadixAttention for efficient prefix caching and structured generation support.

  • llama.cpp: https://github.com/ggerganov/llama.cpp --- The primary open-source project for efficient CPU and consumer GPU inference, including the GGUF model format and quantization methods.

  • Ollama: https://ollama.ai/ --- User-friendly wrapper around llama.cpp for easy local model deployment and management.

Online Resources and Tutorials

  • Lilian Weng's "Large Transformer Model Inference Optimization" (2023): https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ --- Excellent overview of inference optimization techniques including quantization, pruning, distillation, and serving optimizations.

  • Hugging Face Optimum Documentation: https://huggingface.co/docs/optimum/ --- Library for hardware-accelerated inference including ONNX Runtime, Intel OpenVINO, and quantization tools.

  • NVIDIA Deep Learning Performance Guide: https://docs.nvidia.com/deeplearning/performance/ --- Comprehensive guide to optimizing deep learning workloads on NVIDIA GPUs, covering mixed precision, tensor cores, and memory optimization.

  • The Full Stack's LLM Inference Performance Engineering Guide: A series of blog posts covering memory bandwidth analysis, batching strategies, and serving optimization for LLMs.

Software Libraries

  • vLLM (vllm): High-throughput LLM serving with PagedAttention. pip install vllm.

  • bitsandbytes (bitsandbytes): INT8 and INT4 quantization for PyTorch. pip install bitsandbytes.

  • AutoGPTQ (auto-gptq): GPTQ quantization implementation. pip install auto-gptq.

  • AutoAWQ (autoawq): AWQ quantization implementation. pip install autoawq.

  • GGUF/llama-cpp-python (llama-cpp-python): Python bindings for llama.cpp. pip install llama-cpp-python.

  • Optimum (optimum): Hugging Face library for hardware-accelerated inference. pip install optimum.

  • Flash Attention (flash-attn): Optimized attention implementation. pip install flash-attn.

  • DeepSpeed-Inference (deepspeed): Microsoft's inference optimization library with tensor parallelism and kernel fusion. pip install deepspeed.

Advanced Topics for Further Study

  • Mixture of Experts (MoE) Inference: MoE models (Mixtral, Switch Transformer) activate only a subset of parameters per token, providing inherent inference efficiency. Understanding expert routing and load balancing is key to efficient MoE serving.

  • Dynamic Quantization and Mixed Precision: Adapting quantization precision per-layer or per-token based on input difficulty. SmoothQuant and other mixed-precision methods provide better quality-speed tradeoffs than uniform quantization.

  • Inference on Edge Devices: Deploying models on mobile phones, IoT devices, and embedded systems using frameworks like TensorFlow Lite, Core ML, and ONNX Runtime Mobile.

  • Hardware-Specific Optimization: Understanding the memory hierarchy, compute capabilities, and optimization opportunities of specific hardware (NVIDIA H100, AMD MI300X, Google TPU v5, Apple M-series, Intel Gaudi).