Chapter 33: Further Reading
Foundational Papers
Quantization
-
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers." ICLR 2023. The paper that made INT4 quantization practical for large language models. Introduces a layer-wise quantization approach based on approximate second-order information that achieves near-lossless INT4 quantization. https://arxiv.org/abs/2210.17323
-
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. (2023). "AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024. Proposes protecting activation-salient channels during quantization by per-channel scaling, achieving comparable or better quality than GPTQ with significantly faster quantization. https://arxiv.org/abs/2306.00978
-
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS 2023. Introduces 4-bit NormalFloat (NF4) quantization and demonstrates that fine-tuning a quantized model with LoRA adapters matches full 16-bit fine-tuning quality. Essential reading for combining quantization with fine-tuning. https://arxiv.org/abs/2305.14314
-
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022. Identifies the "emergent outlier" phenomenon in transformer activations and proposes mixed-precision decomposition (INT8 for most features, FP16 for outlier features) for zero-degradation 8-bit inference. https://arxiv.org/abs/2208.07339
Attention and Memory Optimization
-
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. The foundational paper for vLLM. Applies OS-style virtual memory management to KV cache, eliminating 60-80% memory waste from fragmentation. A must-read for understanding modern LLM serving. https://arxiv.org/abs/2309.06180
-
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. Introduces IO-aware attention computation that tiles the attention operation to fit in GPU SRAM, achieving 2-4x speedup with linear memory scaling. Now standard in all production inference frameworks. https://arxiv.org/abs/2205.14135
-
Dao, T. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." ICLR 2024. Improves upon FlashAttention with better work partitioning and parallelism, achieving 50-73% of theoretical peak FLOPs on A100 GPUs. https://arxiv.org/abs/2307.08691
-
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023. Introduces Grouped-Query Attention as a middle ground between MHA and MQA, showing that GQA achieves quality close to MHA with KV cache size close to MQA.
Speculative Decoding
-
Leviathan, Y., Kalman, M., and Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. One of two contemporaneous papers introducing speculative decoding. Provides the theoretical framework and proves that the acceptance-rejection scheme produces outputs with exactly the target model's distribution. https://arxiv.org/abs/2211.17192
-
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv preprint arXiv:2302.01318. The other foundational speculative decoding paper, independently developed at DeepMind. Provides practical implementation guidance and empirical results.
Knowledge Distillation
-
Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." NIPS 2014 Deep Learning Workshop. The foundational paper on knowledge distillation. Introduces the concept of "dark knowledge" and the temperature-scaled softmax for extracting richer supervision from teacher models.
-
Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., et al. (2023). "Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes." ACL 2023. Shows that extracting rationales (chain-of-thought explanations) from the teacher and using them as additional supervision can produce students that outperform the teacher on specific tasks.
-
Gu, Y., Dong, L., Wei, F., and Huang, M. (2024). "MiniLLM: Knowledge Distillation of Large Language Models." ICLR 2024. Addresses the challenge of sequence-level distillation for auto-regressive models, proposing a reverse KL divergence objective that prevents the student from overestimating low-probability sequences.
Pruning
-
Frantar, E. and Alistarh, D. (2023). "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot." ICML 2023. Extends the GPTQ framework to pruning, achieving 50-60% unstructured sparsity with minimal accuracy loss on models up to 175B parameters. https://arxiv.org/abs/2301.00774
-
Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. (2024). "A Simple and Effective Pruning Approach for Large Language Models." ICLR 2024. Introduces Wanda (Weights and Activations), a pruning method that considers both weight magnitude and activation magnitude, achieving competitive results without weight reconstruction.
Serving Frameworks Documentation
-
vLLM Documentation: https://docs.vllm.ai/ --- Comprehensive documentation for vLLM including model support, configuration options, distributed inference, and production deployment guides.
-
NVIDIA TensorRT-LLM: https://nvidia.github.io/TensorRT-LLM/ --- Documentation for TensorRT-LLM including model compilation, quantization, parallelism, and benchmarking tools.
-
Hugging Face Text Generation Inference (TGI): https://huggingface.co/docs/text-generation-inference/ --- Documentation for TGI including model deployment, configuration, and integration with the HuggingFace ecosystem.
-
SGLang: https://sgl-project.github.io/ --- Documentation for SGLang including RadixAttention for efficient prefix caching and structured generation support.
-
llama.cpp: https://github.com/ggerganov/llama.cpp --- The primary open-source project for efficient CPU and consumer GPU inference, including the GGUF model format and quantization methods.
-
Ollama: https://ollama.ai/ --- User-friendly wrapper around llama.cpp for easy local model deployment and management.
Online Resources and Tutorials
-
Lilian Weng's "Large Transformer Model Inference Optimization" (2023): https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ --- Excellent overview of inference optimization techniques including quantization, pruning, distillation, and serving optimizations.
-
Hugging Face Optimum Documentation: https://huggingface.co/docs/optimum/ --- Library for hardware-accelerated inference including ONNX Runtime, Intel OpenVINO, and quantization tools.
-
NVIDIA Deep Learning Performance Guide: https://docs.nvidia.com/deeplearning/performance/ --- Comprehensive guide to optimizing deep learning workloads on NVIDIA GPUs, covering mixed precision, tensor cores, and memory optimization.
-
The Full Stack's LLM Inference Performance Engineering Guide: A series of blog posts covering memory bandwidth analysis, batching strategies, and serving optimization for LLMs.
Software Libraries
-
vLLM (
vllm): High-throughput LLM serving with PagedAttention.pip install vllm. -
bitsandbytes (
bitsandbytes): INT8 and INT4 quantization for PyTorch.pip install bitsandbytes. -
AutoGPTQ (
auto-gptq): GPTQ quantization implementation.pip install auto-gptq. -
AutoAWQ (
autoawq): AWQ quantization implementation.pip install autoawq. -
GGUF/llama-cpp-python (
llama-cpp-python): Python bindings for llama.cpp.pip install llama-cpp-python. -
Optimum (
optimum): Hugging Face library for hardware-accelerated inference.pip install optimum. -
Flash Attention (
flash-attn): Optimized attention implementation.pip install flash-attn. -
DeepSpeed-Inference (
deepspeed): Microsoft's inference optimization library with tensor parallelism and kernel fusion.pip install deepspeed.
Advanced Topics for Further Study
-
Mixture of Experts (MoE) Inference: MoE models (Mixtral, Switch Transformer) activate only a subset of parameters per token, providing inherent inference efficiency. Understanding expert routing and load balancing is key to efficient MoE serving.
-
Dynamic Quantization and Mixed Precision: Adapting quantization precision per-layer or per-token based on input difficulty. SmoothQuant and other mixed-precision methods provide better quality-speed tradeoffs than uniform quantization.
-
Inference on Edge Devices: Deploying models on mobile phones, IoT devices, and embedded systems using frameworks like TensorFlow Lite, Core ML, and ONNX Runtime Mobile.
-
Hardware-Specific Optimization: Understanding the memory hierarchy, compute capabilities, and optimization opportunities of specific hardware (NVIDIA H100, AMD MI300X, Google TPU v5, Apple M-series, Intel Gaudi).