Chapter 33: Further Reading

Foundational Papers

Quantization

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers." ICLR 2023. The paper that made INT4 quantization practical for large language models. Introduces a layer-wise quantization approach based on approximate second-order information that achieves near-lossless INT4 quantization. https://arxiv.org/abs/2210.17323
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. (2023). "AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024. Proposes protecting activation-salient channels during quantization by per-channel scaling, achieving comparable or better quality than GPTQ with significantly faster quantization. https://arxiv.org/abs/2306.00978
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS 2023. Introduces 4-bit NormalFloat (NF4) quantization and demonstrates that fine-tuning a quantized model with LoRA adapters matches full 16-bit fine-tuning quality. Essential reading for combining quantization with fine-tuning. https://arxiv.org/abs/2305.14314
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022. Identifies the "emergent outlier" phenomenon in transformer activations and proposes mixed-precision decomposition (INT8 for most features, FP16 for outlier features) for zero-degradation 8-bit inference. https://arxiv.org/abs/2208.07339

Attention and Memory Optimization

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. The foundational paper for vLLM. Applies OS-style virtual memory management to KV cache, eliminating 60-80% memory waste from fragmentation. A must-read for understanding modern LLM serving. https://arxiv.org/abs/2309.06180
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. Introduces IO-aware attention computation that tiles the attention operation to fit in GPU SRAM, achieving 2-4x speedup with linear memory scaling. Now standard in all production inference frameworks. https://arxiv.org/abs/2205.14135
Dao, T. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." ICLR 2024. Improves upon FlashAttention with better work partitioning and parallelism, achieving 50-73% of theoretical peak FLOPs on A100 GPUs. https://arxiv.org/abs/2307.08691
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023. Introduces Grouped-Query Attention as a middle ground between MHA and MQA, showing that GQA achieves quality close to MHA with KV cache size close to MQA.

Speculative Decoding

Leviathan, Y., Kalman, M., and Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. One of two contemporaneous papers introducing speculative decoding. Provides the theoretical framework and proves that the acceptance-rejection scheme produces outputs with exactly the target model's distribution. https://arxiv.org/abs/2211.17192
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv preprint arXiv:2302.01318. The other foundational speculative decoding paper, independently developed at DeepMind. Provides practical implementation guidance and empirical results.

Knowledge Distillation

Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." NIPS 2014 Deep Learning Workshop. The foundational paper on knowledge distillation. Introduces the concept of "dark knowledge" and the temperature-scaled softmax for extracting richer supervision from teacher models.
Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., et al. (2023). "Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes." ACL 2023. Shows that extracting rationales (chain-of-thought explanations) from the teacher and using them as additional supervision can produce students that outperform the teacher on specific tasks.
Gu, Y., Dong, L., Wei, F., and Huang, M. (2024). "MiniLLM: Knowledge Distillation of Large Language Models." ICLR 2024. Addresses the challenge of sequence-level distillation for auto-regressive models, proposing a reverse KL divergence objective that prevents the student from overestimating low-probability sequences.

Pruning

Frantar, E. and Alistarh, D. (2023). "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot." ICML 2023. Extends the GPTQ framework to pruning, achieving 50-60% unstructured sparsity with minimal accuracy loss on models up to 175B parameters. https://arxiv.org/abs/2301.00774
Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. (2024). "A Simple and Effective Pruning Approach for Large Language Models." ICLR 2024. Introduces Wanda (Weights and Activations), a pruning method that considers both weight magnitude and activation magnitude, achieving competitive results without weight reconstruction.

Serving Frameworks Documentation

vLLM Documentation: https://docs.vllm.ai/ --- Comprehensive documentation for vLLM including model support, configuration options, distributed inference, and production deployment guides.
NVIDIA TensorRT-LLM: https://nvidia.github.io/TensorRT-LLM/ --- Documentation for TensorRT-LLM including model compilation, quantization, parallelism, and benchmarking tools.
Hugging Face Text Generation Inference (TGI): https://huggingface.co/docs/text-generation-inference/ --- Documentation for TGI including model deployment, configuration, and integration with the HuggingFace ecosystem.
SGLang: https://sgl-project.github.io/ --- Documentation for SGLang including RadixAttention for efficient prefix caching and structured generation support.
llama.cpp: https://github.com/ggerganov/llama.cpp --- The primary open-source project for efficient CPU and consumer GPU inference, including the GGUF model format and quantization methods.
Ollama: https://ollama.ai/ --- User-friendly wrapper around llama.cpp for easy local model deployment and management.

Online Resources and Tutorials

Lilian Weng's "Large Transformer Model Inference Optimization" (2023): https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ --- Excellent overview of inference optimization techniques including quantization, pruning, distillation, and serving optimizations.
Hugging Face Optimum Documentation: https://huggingface.co/docs/optimum/ --- Library for hardware-accelerated inference including ONNX Runtime, Intel OpenVINO, and quantization tools.
NVIDIA Deep Learning Performance Guide: https://docs.nvidia.com/deeplearning/performance/ --- Comprehensive guide to optimizing deep learning workloads on NVIDIA GPUs, covering mixed precision, tensor cores, and memory optimization.
The Full Stack's LLM Inference Performance Engineering Guide: A series of blog posts covering memory bandwidth analysis, batching strategies, and serving optimization for LLMs.

Software Libraries

vLLM (vllm): High-throughput LLM serving with PagedAttention. pip install vllm.
bitsandbytes (bitsandbytes): INT8 and INT4 quantization for PyTorch. pip install bitsandbytes.
AutoGPTQ (auto-gptq): GPTQ quantization implementation. pip install auto-gptq.
AutoAWQ (autoawq): AWQ quantization implementation. pip install autoawq.
GGUF/llama-cpp-python (llama-cpp-python): Python bindings for llama.cpp. pip install llama-cpp-python.
Optimum (optimum): Hugging Face library for hardware-accelerated inference. pip install optimum.
Flash Attention (flash-attn): Optimized attention implementation. pip install flash-attn.
DeepSpeed-Inference (deepspeed): Microsoft's inference optimization library with tensor parallelism and kernel fusion. pip install deepspeed.

Advanced Topics for Further Study

Mixture of Experts (MoE) Inference: MoE models (Mixtral, Switch Transformer) activate only a subset of parameters per token, providing inherent inference efficiency. Understanding expert routing and load balancing is key to efficient MoE serving.
Dynamic Quantization and Mixed Precision: Adapting quantization precision per-layer or per-token based on input difficulty. SmoothQuant and other mixed-precision methods provide better quality-speed tradeoffs than uniform quantization.
Inference on Edge Devices: Deploying models on mobile phones, IoT devices, and embedded systems using frameworks like TensorFlow Lite, Core ML, and ONNX Runtime Mobile.
Hardware-Specific Optimization: Understanding the memory hierarchy, compute capabilities, and optimization opportunities of specific hardware (NVIDIA H100, AMD MI300X, Google TPU v5, Apple M-series, Intel Gaudi).