Chapter 33: Key Takeaways

Core Concepts

Inference cost dominates training cost for deployed models. Training happens once; inference happens millions of times. Optimizing inference directly translates to reduced operational costs and improved user experience.
The decode phase is memory-bandwidth-bound, not compute-bound. Each auto-regressive token generation step requires reading the entire model weights from GPU memory. Reducing the bytes per weight (quantization) has more impact on decode speed than adding more compute.
Prefill and decode have fundamentally different bottlenecks. Prefill processes many tokens in parallel (compute-bound), while decode generates one token at a time (memory-bandwidth-bound). Effective optimization strategies address each phase differently.

Quantization

INT8 quantization is nearly free in terms of quality. Post-training INT8 quantization typically incurs less than 0.5% accuracy loss while halving model memory and providing 1.5-2x speedup. It should be the default for any production deployment.
INT4 quantization enables larger models on fewer GPUs. The primary value of INT4 is not speed but size—fitting a 70B model on a single GPU instead of four. The quality tradeoff (1-3% accuracy loss) is acceptable for most applications.
AWQ and GPTQ are the leading INT4 methods. AWQ protects activation-salient channels and is faster to apply with better generalization. GPTQ uses Hessian-based error compensation and can achieve slightly higher accuracy on in-distribution data. Both produce comparable results in practice.
Not all weights are equally important. Both AWQ and GPTQ exploit the insight that a small fraction of weight channels are critical for model quality. Protecting these channels during quantization preserves disproportionate quality.

Distillation and Pruning

Knowledge distillation trades model size for training compute. A 70B teacher distilled to a 7B student can retain 85-95% of the teacher's quality at 10x less inference cost. The temperature parameter reveals "dark knowledge" in the teacher's soft probability distributions.
2:4 structured sparsity is the most practical pruning approach. It is the only sparsity pattern with native hardware support on NVIDIA GPUs, providing 2x speedup with minimal accuracy loss. Unstructured sparsity achieves higher compression ratios but lacks hardware acceleration.
Layer pruning is surprisingly effective. Removing 20-30% of a deep transformer's layers can retain 90%+ of performance, suggesting significant redundancy in very deep models.

KV Cache Optimization

The KV cache is often the memory bottleneck, not the model weights. For long-context or high-batch-size deployments, KV cache memory can exceed model weight memory. Optimization is essential for scaling.
Grouped-Query Attention (GQA) provides the best quality-efficiency tradeoff for KV cache reduction. GQA shares K/V across groups of attention heads, reducing KV cache proportionally while maintaining quality much better than full Multi-Query Attention.
KV cache quantization (INT8) halves cache memory with negligible quality impact. Combined with GQA, this can reduce KV cache to 12.5% of the baseline MHA FP16 configuration.

Serving and Batching

PagedAttention eliminates 60-80% of memory waste from KV cache fragmentation. Borrowing virtual memory concepts from operating systems, PagedAttention enables near-zero memory waste and is the single most impactful serving optimization.
Continuous batching is essential for production serving. Dynamic batch management eliminates head-of-line blocking and maximizes GPU utilization, providing 3-10x throughput improvement over static batching.
Speculative decoding generates multiple tokens per target model pass with zero quality loss. The acceptance-rejection scheme guarantees statistical equivalence with the target model's distribution while achieving 2-3x speedup for auto-regressive generation.
Chunked prefill prevents long prompts from blocking decode operations. Splitting long prompt prefills into chunks interleaved with decode steps maintains consistent generation latency for in-flight requests.

Production Engineering

TTFT and TPS are the user-facing metrics that matter. Time to first token determines perceived responsiveness; tokens per second determines reading comfort for streaming responses. Monitor P95 and P99, not just P50.
Prefix caching is high-ROI when requests share system prompts. For applications where many requests share long system prompts, prefix caching eliminates redundant computation proportional to the shared prompt length.
Model cascades reduce average cost by routing easy requests to smaller models. A small model handles high-confidence requests cheaply; a large model handles difficult requests accurately. This can reduce average cost by 50-80%.
The optimizations in this chapter are composable. Combining INT4 quantization, PagedAttention, continuous batching, speculative decoding, and prefix caching can yield 10-24x cost reduction compared to naive serving. Each technique addresses a different bottleneck.
Self-hosting breaks even at approximately 50M tokens per day. Below this volume, API providers are more cost-effective due to lower fixed costs. Above this volume, the marginal cost advantage of self-hosting dominates.
Flash Attention is foundational for all modern inference. By tiling attention computation to fit in GPU SRAM and fusing the entire operation into a single kernel, Flash Attention provides 2-4x speedup with linear memory scaling.
Measure before optimizing. Profile the inference pipeline to identify the actual bottleneck before applying optimizations. Optimizing compute-bound prefill with quantization or memory-bandwidth-bound decode with more compute will have minimal effect. Always optimize the binding constraint.