42 min read

"Training a model is a one-time cost. Serving it is a forever cost."

In This Chapter

Part VI: AI Systems Engineering
33.1 Introduction: The Inference Challenge
33.2 Inference Bottlenecks
33.3 Quantization
33.4 Knowledge Distillation
33.5 Pruning
33.6 KV Cache Optimization
33.7 PagedAttention and vLLM
33.8 Speculative Decoding
33.9 Batching Strategies
33.10 Compilation and Kernel Optimization
33.11 Serving Frameworks
33.12 Latency vs. Throughput Optimization
33.13 Cost Analysis
33.14 Production Deployment Patterns
33.15 Putting It All Together: An Optimization Workflow
33.16 Summary
References

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 33: Inference Optimization and Model Serving

Part VI: AI Systems Engineering

"Training a model is a one-time cost. Serving it is a forever cost."

33.1 Introduction: The Inference Challenge

Training a large language model is expensive—often requiring millions of dollars in compute over weeks or months. But training happens once. Inference—the process of generating predictions from a trained model—happens millions or billions of times. For many organizations, inference costs dwarf training costs within months of deployment.

Consider a practical example: serving a 70-billion-parameter model for a customer-facing application that handles 1,000 requests per second, each generating 200 tokens. Without optimization, this requires a fleet of expensive GPUs, introduces latency that degrades user experience, and generates infrastructure costs that threaten the product's economic viability.

This chapter covers the engineering techniques that make large model inference practical: quantization to reduce model size and memory bandwidth requirements, knowledge distillation to create smaller but capable models, pruning to eliminate unnecessary parameters, attention optimizations to accelerate the generation loop, serving frameworks that maximize throughput, and the systems-level considerations that determine whether a model deployment succeeds or fails in production.

Why Inference Optimization Matters

The economics of inference are stark:

$$\text{Annual Inference Cost} = \text{Requests/sec} \times \text{Cost/Request} \times 31{,}536{,}000 \text{ sec/year}$$

At 1,000 requests per second with a cost of $0.001 per request, annual inference costs reach $31.5 million. Reducing cost per request by even 50% through optimization saves over $15 million annually.

Beyond cost, inference optimization directly affects user experience:

Latency	User Perception
< 100ms	Instantaneous
100-300ms	Responsive
300ms-1s	Noticeable delay
1-3s	Perceived as slow
> 3s	User frustration, abandonment

For token-by-token generation in conversational AI, the key metric is time to first token (TTFT) and tokens per second (TPS). Users expect the first token within 500ms and a generation rate of at least 30 tokens per second for a smooth streaming experience.

The Inference Pipeline

A typical transformer inference pipeline consists of:

Input Tokens → Tokenization → Prefill (Prompt Processing) → Decode (Token Generation) → Detokenization → Output

Tokenization: Convert input text to token IDs (fast, negligible latency).
Prefill (prompt processing): Process all input tokens in parallel to generate the initial KV cache. This is compute-bound.
Decode (auto-regressive generation): Generate tokens one at a time, each requiring a full forward pass. This is memory-bandwidth-bound.
Detokenization: Convert output token IDs back to text (fast, negligible latency).

The prefill and decode phases have fundamentally different computational characteristics, and optimizing each requires different techniques.

33.2 Inference Bottlenecks

33.2.1 Memory Bandwidth: The True Bottleneck

For auto-regressive generation with batch size 1, the primary bottleneck is memory bandwidth, not compute. Each token generation step requires reading the entire model's weights from GPU memory. For a 70B parameter model in FP16:

$$\text{Model Size} = 70 \times 10^9 \times 2 \text{ bytes} = 140 \text{ GB}$$

If the GPU's memory bandwidth is 3.35 TB/s (NVIDIA H100), reading the model weights takes:

$$\text{Read Time} = \frac{140 \text{ GB}}{3{,}350 \text{ GB/s}} \approx 42 \text{ ms per token}$$

This gives a theoretical maximum of approximately 24 tokens per second for a single request—and that is before accounting for KV cache reads, activation computations, and other overhead.

33.2.2 KV Cache Memory

During auto-regressive generation, the model stores key and value tensors from all previous tokens to avoid recomputation. The KV cache size grows linearly with sequence length:

$$\text{KV Cache} = 2 \times n_\text{layers} \times n_\text{heads} \times d_\text{head} \times \text{seq\_len} \times \text{batch\_size} \times \text{bytes\_per\_element}$$

For a 70B model (80 layers, 64 heads, 128 head dimension) with 4096 sequence length and FP16:

$$\text{KV Cache} = 2 \times 80 \times 64 \times 128 \times 4096 \times 2 \text{ bytes} \approx 10.7 \text{ GB per request}$$

With batch size 32, the KV cache alone requires 342 GB—more than the memory of most single GPUs. KV cache management is therefore a critical optimization target.

33.2.3 Compute vs. Memory Bound Analysis

The arithmetic intensity (also called operational intensity) measures the ratio of computation to data movement. It determines whether a workload is compute-bound or memory-bound:

$$\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Transferred}}$$

where: - FLOPs is the number of floating-point operations performed - Bytes Transferred is the volume of data read from and written to GPU memory (HBM)

The GPU has a characteristic ridge point: the arithmetic intensity at which compute and memory bandwidth are balanced. For the H100, this is approximately $\text{989 TFLOPS} / \text{3.35 TB/s} \approx 295$ FLOPs/byte (for FP16). Operations below this intensity are memory-bound; operations above are compute-bound.

Prefill (large batch of tokens): High arithmetic intensity → compute-bound. The matrix multiplications process many tokens at once, amortizing the cost of reading weights. Benefit from faster compute (tensor cores, larger batch sizes).
Decode (single token, batch size 1): Low arithmetic intensity (~1-2 FLOPs/byte) → memory-bandwidth-bound. Each weight is read from HBM for just one multiply-add operation. Benefit from reduced model size (quantization) and faster memory.

This dichotomy explains why quantization (reducing bytes per weight) dramatically accelerates decode speed but has modest impact on prefill speed. It also explains why batching is so effective: increasing batch size from 1 to 32 increases arithmetic intensity by 32x, potentially moving the workload from memory-bound to compute-bound.

Practical implication: If you measure your serving system and find GPU compute utilization is below 30%, you are almost certainly memory-bandwidth-bound and should focus on quantization and batch size optimization. If compute utilization is above 70%, you are compute-bound and should focus on compilation optimizations and hardware upgrades.

33.3 Quantization

Quantization reduces the numerical precision of model weights (and optionally activations) from higher-precision formats (FP32, FP16, BF16) to lower-precision formats (INT8, INT4, or even lower). This reduces memory footprint, increases memory bandwidth efficiency, and often enables faster computation through integer arithmetic.

33.3.1 Quantization Fundamentals

The intuition behind quantization is straightforward: if a weight value is 0.3847291564, do we really need all those decimal places? In practice, representing it as 0.38 (or even as an integer scaled by a factor) is often sufficient, and the reduced precision saves enormous amounts of memory and bandwidth.

Linear (uniform) quantization maps floating-point values to integers using a linear transformation:

$$q = \text{round}\left(\frac{x - z}{s}\right), \quad \hat{x} = s \cdot q + z$$

where: - $x$ is the original floating-point value - $q$ is the quantized integer representation - $s$ is the scale factor (determines the step size between quantized levels) - $z$ is the zero point (the quantized value that maps to floating-point zero) - $\hat{x}$ is the dequantized (reconstructed) value

For symmetric quantization (zero point $z = 0$), the scale is computed from the maximum absolute value:

$$s = \frac{\max(|\mathbf{x}|)}{2^{b-1} - 1}$$

where $b$ is the bit width.

Worked example. Consider quantizing the weight vector $\mathbf{x} = [-0.8, -0.2, 0.1, 0.5, 0.9]$ to INT8 (8-bit signed integers, range $[-127, 127]$) using symmetric quantization. First, compute the scale: $s = \frac{\max(|-0.8|, |-0.2|, |0.1|, |0.5|, |0.9|)}{127} = \frac{0.9}{127} \approx 0.00709$. Then quantize each value: $q = \text{round}(x / s)$, giving $[-113, -28, 14, 71, 127]$. To dequantize: $\hat{x} = s \cdot q = [-0.801, -0.199, 0.099, 0.503, 0.901]$. The maximum quantization error is $|x - \hat{x}| \leq s/2 \approx 0.0035$—negligible for most neural network operations.

Per-tensor vs. per-channel vs. per-group quantization: The granularity of the scale factor determines the accuracy-efficiency tradeoff: - Per-tensor: A single scale for the entire weight tensor. Simplest but least accurate, because outlier values in one part of the tensor force a coarse scale everywhere. - Per-channel: A separate scale for each output channel (row of the weight matrix). Much more accurate because each channel can adapt to its own value range. - Per-group: A separate scale for each group of $g$ consecutive weights (e.g., $g = 128$). This is the sweet spot used by GPTQ and AWQ, providing near-per-channel accuracy with manageable metadata overhead.

The storage overhead for scale factors is small: for a 4096 x 4096 weight matrix quantized per-group with $g = 128$, we store $4096 \times 4096 / 128 = 131{,}072$ FP16 scale factors (256 KB), compared to 32 MB for the INT4 weights themselves—less than 1% overhead.

33.3.2 Post-Training Quantization (PTQ) vs. Quantization-Aware Training (QAT)

There are two fundamental approaches to quantization, and understanding their tradeoffs is essential for choosing the right strategy.

Post-Training Quantization (PTQ) quantizes a pre-trained model without additional training. You take a finished FP16 model and convert it directly to lower precision. PTQ is fast (minutes to hours), requires no training data (or only a small calibration set), and preserves the original model weights. The downside is that accuracy degradation can be significant at very low bit widths (below INT4) because the model was never trained to tolerate quantization error.

Quantization-Aware Training (QAT) inserts simulated quantization operations into the training (or fine-tuning) loop. During forward passes, weights and activations are quantized and dequantized to simulate the effect of lower precision. During backward passes, gradients flow through these operations using the straight-through estimator (STE). This allows the model to learn to compensate for quantization error, producing more robust low-precision models. QAT typically achieves 0.5-1.0% better accuracy than PTQ at the same bit width, but requires a full training pipeline and significant compute.

For most LLM deployment scenarios, PTQ is preferred because it avoids the cost and complexity of retraining. QAT is reserved for cases where every fraction of a percent matters or when targeting extremely low bit widths (2-3 bits).

INT8 PTQ: - Reduces model size by 2x (from FP16). - Minimal accuracy loss (typically < 0.5% on benchmarks). - Supported natively by most inference frameworks. - Both weight-only and weight-activation quantization available.

INT4 PTQ (Weight-Only): - Reduces model size by 4x (from FP16). - Weights are stored as INT4, dequantized to FP16 for computation. - Moderate accuracy loss (1-3% on benchmarks), varies by model. - Primary benefit is reduced memory usage, enabling larger models on fewer GPUs.

An important distinction is weight-only vs. weight-and-activation quantization. Weight-only quantization stores weights in low precision but performs computation in FP16/BF16 after dequantization. Weight-and-activation quantization also quantizes the intermediate activations, enabling integer arithmetic throughout—which is faster but harder to do without accuracy loss, because activations have more dynamic range than weights.

33.3.3 GPTQ: Accurate Post-Training Quantization

GPTQ (Frantar et al., 2022) is an advanced PTQ method that minimizes quantization error by solving a layer-wise optimization problem:

$$\hat{W} = \arg\min_{\hat{W}} \|WX - \hat{W}X\|_2^2$$

where $W$ is the original weight matrix, $\hat{W}$ is the quantized weight matrix, and $X$ is a calibration dataset of activations.

GPTQ processes weights column by column, using the Hessian of the quantization error to determine the optimal quantization order and compensate for quantization errors in subsequent columns. Key characteristics:

Achieves near-lossless INT4 quantization for models 1B+ parameters.
Requires a small calibration dataset (128-256 samples).
Quantization takes 1-4 hours depending on model size.
Produces models compatible with fast INT4 CUDA kernels.

33.3.4 AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2023) observes that not all weights are equally important—some weights correspond to channels with large activation magnitudes and are critical for model performance. The intuition is elegant: if a particular input channel consistently has large activations, then the corresponding weights have an outsized effect on the output. Quantization error in these weights is amplified by the large activations, so we should protect them.

AWQ protects these "salient" weights by:

Identifying salient channels based on activation magnitude statistics from a calibration set. Compute $\|\mathbf{X}_j\|_2$ for each input channel $j$ across calibration examples.
Applying per-channel scaling: Multiply salient weight channels by a scale factor $s_j > 1$ before quantization, then divide the corresponding activations by $s_j$. The computation is mathematically equivalent ($\mathbf{W}\mathbf{x} = (\mathbf{W} \cdot \text{diag}(\mathbf{s}))(\text{diag}(\mathbf{s})^{-1} \cdot \mathbf{x})$), but the scaled weights have a more quantization-friendly distribution.
Quantizing the scaled weights, which preserves precision where it matters most because the salient channels now have larger magnitudes relative to the quantization step size.

AWQ advantages over GPTQ: - Faster quantization (minutes vs. hours) because it does not require iterative column-by-column optimization. - Better generalization to unseen tasks (not overfit to calibration data). - Comparable or better accuracy at INT4. - The scale factors are computed analytically, not through iterative optimization.

33.3.5 GGUF: The Universal Quantization Format

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and its ecosystem. It has become the de facto standard for running quantized models on consumer hardware (CPUs, Apple Silicon, single consumer GPUs). GGUF supports a rich set of quantization types:

Quantization	Bits/Weight	Description	Quality
Q2_K	~2.5	2-bit with K-quant groups	Poor for most uses
Q3_K_M	~3.4	3-bit with medium K-quants	Acceptable for small models
Q4_0	4.0	Simple 4-bit, per-block	Good baseline
Q4_K_M	~4.6	4-bit with medium K-quants	Recommended default
Q5_K_M	~5.5	5-bit with medium K-quants	Near-lossless
Q6_K	~6.6	6-bit K-quant	Very close to FP16
Q8_0	8.0	8-bit per-block	Negligible quality loss

The "K-quant" variants (denoted by _K) use importance-based mixed quantization: more important layers (like the first and last layers, and attention layers) receive higher precision, while less critical intermediate layers are quantized more aggressively. The _M suffix indicates medium quality (balanced), while _S is small (more aggressive) and _L is large (higher quality).

Practical recommendation. For most use cases, Q4_K_M provides the best balance of quality and size. As we discussed in Section 33.2.1, the decode phase is memory-bandwidth-bound, so reducing model size by 4x (from FP16 to Q4_K_M) translates directly to ~4x faster token generation on bandwidth-constrained hardware.

33.3.6 Quantization in Practice

# Conceptual workflow for quantizing a model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# INT8 quantization with bitsandbytes
model_8bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-70B",
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
    device_map="auto",
)

# INT4 quantization with bitsandbytes (NF4)
model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-70B",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4",
    ),
    device_map="auto",
)

Format	Size (70B model)	GPUs Needed	Accuracy Loss	Speed Gain
FP16	140 GB	4x A100 80GB	Baseline	Baseline
INT8	70 GB	2x A100 80GB	< 0.5%	~1.5x
INT4 (GPTQ)	35 GB	1x A100 80GB	1-2%	~2-3x
INT4 (AWQ)	35 GB	1x A100 80GB	1-2%	~2-3x
GGUF Q4_K_M	~40 GB	CPU or 1x GPU	2-3%	Varies

See Example 33.1 (code/example-01-quantization.py) for a complete quantization implementation.

33.4 Knowledge Distillation

33.4.1 The Distillation Paradigm

Knowledge distillation transfers knowledge from a large teacher model to a smaller student model. The student learns to mimic the teacher's behavior, producing a model that is much smaller and faster but retains much of the teacher's capability.

The distillation loss combines the standard task loss with a distribution-matching loss:

$$\mathcal{L} = \alpha \cdot \mathcal{L}_\text{task}(y, \hat{y}_\text{student}) + (1 - \alpha) \cdot \mathcal{L}_\text{distill}(\hat{y}_\text{teacher}, \hat{y}_\text{student})$$

where $\alpha$ balances the two losses and $\mathcal{L}_\text{distill}$ is typically the KL divergence between the teacher's and student's probability distributions over tokens, scaled by a temperature $T$:

$$\mathcal{L}_\text{distill} = T^2 \cdot \text{KL}\left(\text{softmax}\left(\frac{z_\text{teacher}}{T}\right) \| \text{softmax}\left(\frac{z_\text{student}}{T}\right)\right)$$

Higher temperatures produce softer probability distributions, revealing more of the teacher's "dark knowledge"—the relative probabilities assigned to incorrect tokens, which encode rich structural information about the task.

33.4.2 Distillation Strategies for LLMs

Output Distillation: The student learns to match the teacher's token-level probability distributions. This is the most common and straightforward approach. The key advantage is that soft probability distributions contain far more information than hard labels—the teacher's uncertainty about alternative tokens encodes structural knowledge about language.

Feature Distillation: The student learns to match intermediate representations (hidden states, attention patterns) of the teacher. Requires architectural compatibility or projection layers. Feature distillation is less common for LLMs because different model sizes have different hidden dimensions, requiring projection layers that add complexity.

Data Distillation (also called "synthetic data distillation"): The teacher generates training data (input-output pairs), and the student is trained on this synthetic data using standard supervised learning. This is the simplest approach and requires no access to the teacher's internals—only its outputs. In practice, this is the most widely used approach for distilling commercial LLMs: generate millions of high-quality examples using GPT-4 or Claude, then fine-tune an open-source model on these examples. Models like Alpaca, Vicuna, and many others were created this way.

Step-by-Step Distillation (Hsieh et al., 2023): The teacher generates chain-of-thought reasoning, and the student is trained to produce both the reasoning and the answer. This transfers the teacher's reasoning process, not just its answers. Remarkably, Hsieh et al. showed that a 770M-parameter student trained with step-by-step distillation can outperform a 540B-parameter PaLM model on certain tasks—demonstrating that reasoning capabilities can be compressed far more efficiently than raw knowledge.

Worked example. Suppose we want to distill a 70B teacher model into a 7B student for a customer support classification task. With data distillation:

# Step 1: Generate training data from teacher
teacher_data = []
for prompt in training_prompts:
    response = teacher_model.generate(
        f"Classify this customer message and explain your reasoning:\n{prompt}"
    )
    teacher_data.append({"input": prompt, "output": response})

# Step 2: Fine-tune the student on teacher-generated data
from transformers import Trainer, TrainingArguments

trainer = Trainer(
    model=student_model,
    args=TrainingArguments(
        output_dir="./distilled_model",
        num_train_epochs=3,
        per_device_train_batch_size=8,
        learning_rate=2e-5,
    ),
    train_dataset=format_dataset(teacher_data),
)
trainer.train()

The student learns to mimic not just the teacher's classifications but its reasoning patterns, producing a model that is 10x smaller and 5-10x faster while retaining 90-95% of the teacher's accuracy on the target task.

33.4.3 Practical Considerations

Teacher-student size ratio: A 4-10x reduction in parameters is typical. Distilling a 70B model to 7B or 13B is common.
Data requirements: Distillation typically requires 10K-1M examples from the teacher.
Quality filtering: Not all teacher outputs are equally good. Filtering or weighting by quality improves student performance.
Task-specific vs. general: Task-specific distillation (distilling on data similar to the deployment task) yields better results than general distillation.

See Example 33.2 (code/example-02-distillation.py) for a distillation implementation.

33.5 Pruning

33.5.1 Types of Pruning

Pruning removes unnecessary parameters from a model to reduce size and computation:

Unstructured Pruning: Sets individual weights to zero. Achieves high sparsity (90%+) with minimal accuracy loss but requires specialized sparse computation kernels for actual speedup.

Structured Pruning: Removes entire structural units (attention heads, neurons, layers). Lower sparsity rates but produces standard dense models that benefit from existing optimized kernels.

Semi-Structured Pruning (2:4 Sparsity): NVIDIA's N:M sparsity pattern (2 out of every 4 weights are zero) is supported by hardware sparse tensor cores on Ampere and later GPUs, providing 2x speedup with minimal accuracy loss.

33.5.2 Pruning Methods

Magnitude Pruning: Remove weights with the smallest absolute values. Simple but effective, especially at moderate sparsity levels.

Wanda (Pruning by Weights and Activations): Extends magnitude pruning by considering both weight magnitude and activation magnitude. Prunes weights where $|w_{ij}| \cdot \|X_j\|_2$ is smallest, where $X_j$ is the $j$-th input feature across calibration data.

SparseGPT: An extension of the GPTQ approach to pruning. Solves a layer-wise sparse reconstruction problem, achieving high sparsity with minimal accuracy loss.

33.5.3 Layer Pruning

An aggressive form of structured pruning removes entire transformer layers. Research has shown that many intermediate layers in deep transformers are surprisingly redundant:

Removing 20-30% of layers from a 70B model can retain 90%+ of performance on many tasks.
Layer importance can be estimated by measuring the change in hidden state across the layer: layers with small changes contribute little and can be safely removed.

The intuition behind layer importance is that if a layer barely changes the hidden state, its computation is nearly an identity function—and identity functions can be removed without consequence. We formalize this as:

$$\text{Layer Importance}(l) = \frac{\|\mathbf{h}_l - \mathbf{h}_{l-1}\|_2}{\|\mathbf{h}_{l-1}\|_2}$$

where: - $\mathbf{h}_l$ is the hidden state after layer $l$ - $\mathbf{h}_{l-1}$ is the hidden state before layer $l$ - A small ratio indicates the layer contributes little

Worked example. Suppose a 32-layer model has layer importance scores: $[0.15, 0.22, 0.08, 0.31, \ldots, 0.04, 0.19]$. Layers with scores below 0.05 (e.g., layer 31 with score 0.04) are candidates for removal. After removing 8 such layers, we get a 24-layer model that runs 25% faster. We then fine-tune for 1,000 steps on a small dataset to recover any lost accuracy—this "healing" step typically recovers 50-80% of the accuracy drop from pruning.

Layer pruning is particularly effective when combined with distillation: prune the model first to get a smaller architecture, then distill from the full model into the pruned model to recover quality. This two-step process often produces better results than either technique alone.

33.6 KV Cache Optimization

33.6.1 The KV Cache Problem

As discussed in Section 33.2.2, the KV cache grows linearly with sequence length and batch size, often consuming more memory than the model weights themselves. This limits:

Maximum sequence length: Longer sequences require more KV cache memory.
Maximum batch size: More concurrent requests require more total KV cache memory.
Throughput: Memory consumed by KV cache cannot be used for additional requests.

33.6.2 Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)

Standard multi-head attention uses separate K and V projections for each head. MQA and GQA reduce KV cache by sharing K and V across heads.

The intuition is straightforward: in standard multi-head attention with $n$ heads, we store $n$ separate sets of K and V vectors. But do all $n$ heads really need independent keys and values? Empirically, the answer is "not entirely"—many heads attend to similar patterns, so sharing K and V across heads loses surprisingly little information while dramatically reducing memory.

Multi-Query Attention (MQA): All heads share a single set of K and V. Reduces KV cache by $n_\text{heads}\times$. Used in Falcon, PaLM. The Q projection remains per-head, so each head can still attend differently, but they all attend over the same keys and values.
Grouped-Query Attention (GQA): Groups of heads share K and V. With $g$ groups and $n$ heads, reduces KV cache by $n/g$ times. Used in Llama 2/3, Mistral.

Worked example. For Llama-3-70B with 64 heads and 8 KV groups (GQA with $g = 8$), the KV cache reduction is $64/8 = 8\times$ compared to full multi-head attention. At sequence length 4096 in FP16, the KV cache per request drops from 10.7 GB to approximately 1.3 GB—the difference between needing 4 GPUs and 1 GPU for a batch of 8 requests.

GQA provides a favorable tradeoff: most of the KV cache savings of MQA with minimal quality degradation compared to full multi-head attention. Models can be trained with GQA from scratch, or existing MHA models can be converted to GQA through "uptraining" (fine-tuning for a small number of additional steps with the GQA architecture).

33.6.3 KV Cache Quantization

The KV cache can be quantized to lower precision (INT8 or INT4) to reduce its memory footprint. Since the KV cache is created dynamically during inference, this is a form of activation quantization:

INT8 KV cache: 2x memory reduction with negligible accuracy loss.
INT4 KV cache: 4x memory reduction with small accuracy loss (model-dependent).

33.6.4 KV Cache Compression

Beyond quantization, several techniques compress the KV cache:

Sliding window attention: Only attend to the most recent $w$ tokens. Used in Mistral (window size 4096).
Token eviction: Evict less important tokens from the KV cache based on attention scores. H2O (Heavy-Hitter Oracle) retains tokens that receive the most attention.
Prompt compression: Compress long prompts before inference using a learned compression model.

33.7 PagedAttention and vLLM

33.7.1 The Memory Fragmentation Problem

Traditional KV cache management pre-allocates contiguous memory for each request's maximum possible sequence length. This leads to:

Internal fragmentation: Most requests do not use the full allocated length, wasting memory.
External fragmentation: As requests arrive and complete, free memory becomes scattered, preventing new allocations even when total free memory is sufficient.

Research by Kwon et al. (2023) showed that existing systems waste 60-80% of KV cache memory due to fragmentation.

33.7.2 PagedAttention

PagedAttention (Kwon et al., 2023), implemented in the vLLM framework, borrows the virtual memory concept from operating systems. Instead of contiguous allocation, KV cache is managed in fixed-size pages (blocks):

KV cache is divided into blocks of fixed size (e.g., 16 tokens per block).
A block table maps each request's logical blocks to physical memory locations.
Blocks are allocated on demand as the sequence grows.
Completed blocks can be freed immediately and reused.

Benefits: - Near-zero memory waste (fragmentation eliminated). - 2-4x improvement in throughput over naive implementations. - Enables efficient memory sharing for techniques like beam search and parallel sampling.

33.7.3 vLLM Architecture

vLLM provides a complete serving solution built around PagedAttention:

Client Requests
      │
      ▼
┌──────────────┐
│  Scheduler   │ ── Continuous batching, priority queuing
│              │
├──────────────┤
│  KV Cache    │ ── PagedAttention memory management
│  Manager     │
├──────────────┤
│  Model       │ ── Optimized transformer execution
│  Executor    │
├──────────────┤
│  Worker(s)   │ ── GPU workers for tensor/pipeline parallel
└──────────────┘

Key features: - Continuous batching: New requests join the batch immediately without waiting for existing requests to complete. - Prefix caching: System prompts shared across requests are cached and reused. This is particularly impactful for agent systems (as we discussed in Chapter 32) where every request shares a long system prompt with tool definitions. - Tensor parallelism: Model split across multiple GPUs for large models. - Speculative decoding integration: Built-in support for draft-model speculation. - LoRA serving: Multiple LoRA adapters can be served simultaneously on the same base model, enabling multi-tenant deployments where each customer has a fine-tuned version. - OpenAI-compatible API: Drop-in replacement for the OpenAI API, making migration straightforward.

33.7.4 Memory Budget Calculation for vLLM

Understanding how vLLM allocates GPU memory is essential for capacity planning. The total GPU memory is partitioned as:

$$\text{GPU Memory} = \text{Model Weights} + \text{KV Cache} + \text{Activations} + \text{CUDA Overhead}$$

The gpu_memory_utilization parameter (default 0.90) controls what fraction of GPU memory vLLM may use. Of this usable memory, model weights and CUDA overhead are fixed costs; the remaining memory is allocated to KV cache, which directly determines the maximum concurrent requests.

Worked example. Consider serving Llama-3-8B (FP16, ~16 GB) on an A100 80 GB: - Total usable: $80 \times 0.90 = 72$ GB - Model weights: 16 GB - CUDA overhead: ~2 GB - Available for KV cache: $72 - 16 - 2 = 54$ GB

With GQA (8 KV groups), KV cache per token: $2 \times 32 \times 8 \times 128 \times 2 = 131$ KB. At 4096 tokens per request: $131 \times 4096 / 1024 = 524$ MB per request. Maximum concurrent requests: $54{,}000 / 524 \approx 103$ concurrent requests. This calculation shows why PagedAttention's elimination of fragmentation is so valuable—without it, only 20-40 requests would fit.

33.8 Speculative Decoding

33.8.1 The Idea

Auto-regressive generation is inherently sequential: each token depends on all previous tokens. This makes it impossible to parallelize the generation itself. Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) circumvents this by:

A small, fast draft model generates $K$ candidate tokens quickly.
The large target model verifies all $K$ tokens in a single parallel forward pass.
Accept the longest prefix of draft tokens that the target model agrees with.
Sample one additional token from the target model's distribution for the first rejected position.

This produces outputs that are statistically identical to the target model (no approximation) while achieving speedups of 2-3x, because the target model processes $K$ tokens in one pass instead of $K$ separate passes.

33.8.2 The Verification Algorithm

The mathematical beauty of speculative decoding lies in its verification algorithm, which guarantees exact equivalence with the target model. Let us walk through it carefully.

For each draft token position $i$ (where $i = 1, \ldots, K$), the draft model has generated token $x_i$ with probability $p(x_i | x_{

If $q(x_i) \geq p(x_i)$: Accept the token. The target model is at least as likely to generate this token as the draft model, so accepting it cannot oversample any token.
If $q(x_i) < p(x_i)$: Accept with probability $q(x_i) / p(x_i)$; reject otherwise. This downsamples tokens that the draft model overproduces relative to the target.

If a token at position $j$ is rejected (and all tokens at positions $1, \ldots, j-1$ were accepted), we discard tokens $j, j+1, \ldots, K$ and sample one new token from the residual distribution:

$$p_\text{residual}(x) = \text{normalize}\left(\max(0, q(x) - p(x))\right)$$

where: - $q(x)$ is the target model's probability for token $x$ - $p(x)$ is the draft model's probability for token $x$ - The $\max(0, \cdot)$ ensures we only sample from tokens that the target model likes more than the draft model - The normalization ensures this is a valid probability distribution

Worked example. Suppose the draft model generates tokens ["The", "cat", "sat"] with probabilities $p = [0.8, 0.6, 0.4]$, and the target model assigns probabilities $q = [0.9, 0.3, 0.5]$.

Token "The": $q(0.9) \geq p(0.8)$ -- Accept.
Token "cat": $q(0.3) < p(0.6)$ -- Accept with probability $0.3/0.6 = 0.5$. Suppose we sample $u = 0.7 > 0.5$ -- Reject.
Token "sat" and beyond: Discarded.

We accepted 1 token ("The") from the draft, then sample position 2 from the residual distribution $\max(0, q(x) - p(x))$ normalized over the vocabulary. This might sample "dog" instead of "cat" if the target model preferred "dog" more strongly.

The total number of tokens generated is: (number accepted) + 1 (from residual or from the target's distribution at the first rejected position). In the best case, all $K$ tokens are accepted and we also get one bonus token, yielding $K + 1$ tokens from a single target model pass. In the worst case (all rejected), we get exactly 1 token—the same as standard decoding.

This acceptance-rejection scheme guarantees that the final output distribution is exactly the target model's distribution, regardless of the draft model's quality. This is not an approximation—it is a mathematically proven equivalence.

33.8.3 Draft Model Selection

The ideal draft model: - Is much faster than the target model (4-10x fewer parameters). - Has high acceptance rate (agrees with the target model frequently). - Shares the same vocabulary as the target model.

Common choices: - A smaller model from the same family (e.g., Llama-3-8B drafting for Llama-3-70B). - A quantized version of the target model. - An n-gram model or small trained head.

33.8.4 Speedup Analysis

The expected speedup from speculative decoding depends on the acceptance rate $\alpha$ (the probability that a draft token is accepted) and the number of draft tokens $K$:

$$\text{Expected tokens per step} = \frac{1 - \alpha^{K+1}}{1 - \alpha}$$

For $\alpha = 0.8$ and $K = 5$: expected tokens = $(1 - 0.8^6) / (1 - 0.8) = (1 - 0.262) / 0.2 = 3.69$ tokens per target model pass, compared to 1 token without speculation—a 3.7x improvement in token generation rate. However, the wall-clock speedup is lower because the draft model takes time too. If the draft model is 5x faster than the target, the actual speedup is approximately $3.69 / (1 + 5/5) = 1.85$x.

In practice, speculative decoding achieves 1.5-3x speedup depending on the task (code generation has higher acceptance rates than creative writing because code is more predictable).

33.8.5 Self-Speculative Decoding

An alternative that avoids needing a separate draft model: use layer skipping within the target model itself. The model generates draft tokens using only a subset of its layers (e.g., every other layer or the first half of layers), then verifies using all layers. This eliminates the need for a separate model and guarantees vocabulary compatibility, at the cost of a lower acceptance rate compared to a dedicated draft model.

Medusa (Cai et al., 2024) takes another approach: it adds multiple lightweight "heads" to the model that predict tokens 2, 3, ..., $K$ positions ahead. These heads are trained while keeping the base model frozen, and they generate draft tokens in parallel—no sequential draft generation needed. This makes Medusa particularly latency-friendly for the draft phase.

33.9 Batching Strategies

33.9.1 Static Batching

The simplest approach: collect multiple requests into a fixed batch and process them together. All requests in the batch must wait for the longest request to complete before any can return results.

Pros: Simple implementation, good GPU utilization during prefill. Cons: Head-of-line blocking (fast requests wait for slow ones), poor utilization during decode.

33.9.2 Continuous Batching (In-Flight Batching)

Requests are dynamically added to and removed from the batch: - New requests join the batch immediately (during the next generation step). - Completed requests leave the batch immediately. - No head-of-line blocking.

This is the standard in production serving frameworks (vLLM, TensorRT-LLM, TGI).

33.9.3 Prefill-Decode Disaggregation

Since prefill is compute-bound and decode is memory-bandwidth-bound, some architectures use separate hardware for each. This insight—that the two phases have fundamentally different computational profiles—is one of the most important architectural observations in LLM serving.

Prefill cluster: GPUs optimized for high compute throughput (e.g., H100 with high FLOPS) process prompt encoding. These GPUs are fully utilized because prefill can saturate their compute capacity.
Decode cluster: GPUs optimized for high memory bandwidth generate tokens. These GPUs benefit from lower-precision quantization that reduces memory bandwidth requirements.

The prefill cluster processes the prompt and transfers the resulting KV cache to the decode cluster, which then generates tokens. This prevents long prefills from blocking decode operations and allows each cluster to be independently scaled.

When to use disaggregation. This architecture is most beneficial when there is high variance in prompt lengths. If some requests have 100-token prompts and others have 10,000-token prompts, the long prefills would cause head-of-line blocking in a unified system. With disaggregation, the decode cluster maintains consistent latency regardless of prompt lengths arriving at the prefill cluster.

Companies like Anyscale and Deepseek have deployed disaggregated architectures in production, reporting 30-50% improvements in decode latency consistency compared to unified serving.

33.9.4 Chunked Prefill

Long prompts can block the entire batch during prefill. Chunked prefill splits long prompts into smaller chunks, interleaving prefill chunks with decode steps:

Step 1: Prefill chunk 1 of request A + Decode step for requests B, C
Step 2: Prefill chunk 2 of request A + Decode step for requests B, C
Step 3: Decode step for requests A, B, C

This maintains consistent decode latency for existing requests even when new long-context requests arrive.

33.10 Compilation and Kernel Optimization

33.10.1 TensorRT-LLM

NVIDIA's TensorRT-LLM compiles transformer models into optimized CUDA execution plans. It represents the "maximum performance" option for NVIDIA GPUs, achieving the highest throughput and lowest latency at the cost of more complex deployment.

Key optimizations: - Operator fusion: Combines multiple operations (e.g., linear + bias + activation) into single kernels. This reduces memory roundtrips—instead of writing intermediate results to HBM and reading them back, fused kernels keep data in registers or shared memory. - Quantization kernels: Optimized INT8 and INT4 GEMM kernels using tensor cores. TensorRT-LLM provides the fastest INT4 inference through custom CUDA kernels that are hand-tuned for each GPU architecture. - Custom attention kernels: Flash attention, paged attention, and other optimized attention implementations. - Pipeline and tensor parallelism: Built-in multi-GPU support. - In-flight batching: TensorRT-LLM's name for continuous batching, with additional optimizations for request scheduling. - FP8 support on Hopper: The H100 GPU supports FP8 (8-bit floating point) natively, and TensorRT-LLM provides FP8 quantization that is faster than INT8 with comparable accuracy.

The tradeoff with TensorRT-LLM is complexity: models must be explicitly compiled for each target GPU architecture, the build process can take hours for large models, and debugging is more difficult than with Python-based frameworks. For teams with NVIDIA hardware expertise, the performance gains (10-30% over vLLM) often justify the investment.

33.10.2 ONNX Runtime

The Open Neural Network Exchange (ONNX) provides a portable model format with runtime optimizations:

Graph optimization: Constant folding, operator fusion, dead code elimination.
Hardware-specific backends: CUDA, TensorRT, DirectML, OpenVINO.
Quantization-aware optimization: INT8 inference with calibration.

33.10.3 torch.compile

PyTorch's torch.compile applies graph-level optimizations to PyTorch models:

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
model = torch.compile(model, mode="reduce-overhead")

Optimizations include operator fusion, memory planning, and CUDA graph capture. Particularly effective for reducing the overhead of many small kernel launches during auto-regressive generation.

33.10.4 Flash Attention

Flash Attention (Dao et al., 2022; Dao, 2023) is one of the most impactful single optimizations in modern deep learning. To understand why, consider the standard attention computation:

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

The naive implementation computes $\mathbf{Q}\mathbf{K}^T$ (an $N \times N$ matrix where $N$ is the sequence length), stores it in GPU HBM (high-bandwidth memory), applies softmax, stores the result, then multiplies by $\mathbf{V}$. For $N = 8192$ in FP16, the attention matrix alone requires $8192^2 \times 2 = 128$ MB. Each read and write to HBM takes orders of magnitude longer than the actual arithmetic.

Flash Attention exploits the GPU memory hierarchy by:

Tiling the computation into blocks that fit in SRAM (on-chip memory, ~20 MB on an A100). Each block computes a partial attention output that is later combined.
Fusing the entire attention operation ($\mathbf{Q}\mathbf{K}^T$, scaling, softmax, $\times \mathbf{V}$) into a single CUDA kernel. This avoids writing intermediate results to HBM entirely.
Online softmax: Uses the online softmax algorithm to compute exact softmax across tiles without needing to see all values at once. This is the key mathematical trick—softmax normally requires a global maximum, but the online variant tracks a running maximum and correction factor.
Avoiding materializing the full $N \times N$ attention matrix in GPU memory.

Results: - 2-4x speedup for attention computation. - Memory usage scales as $O(N)$ with sequence length instead of $O(N^2)$. - Standard in all modern inference frameworks (vLLM, TensorRT-LLM, TGI). - Flash Attention 2 further improves parallelism across sequence length and attention heads. - Flash Attention 3 (2024) targets the Hopper architecture (H100) with warp-specialization and FP8 support.

The impact of Flash Attention extends beyond inference to training: it enables training on much longer sequences (64K+ tokens) without running out of memory, which is essential for long-context models. As we discussed in Chapter 31, long-context LLMs are important for RAG systems that stuff many retrieved documents into the prompt.

33.11 Serving Frameworks

33.11.1 Framework Comparison

The LLM serving landscape has matured rapidly. Each framework occupies a distinct niche:

Framework	Developer	Key Feature	Best For
vLLM	UC Berkeley	PagedAttention	High-throughput serving
TensorRT-LLM	NVIDIA	NVIDIA GPU optimization	Maximum performance on NVIDIA hardware
TGI (Text Generation Inference)	Hugging Face	HF ecosystem integration	HF model deployment
SGLang	Stanford	RadixAttention, structured generation	Complex prompting patterns
Ollama	Ollama	Easy local deployment	Local/edge deployment
llama.cpp	ggerganov	CPU/Apple Silicon inference	Consumer hardware
Triton Inference Server	NVIDIA	Multi-model, multi-framework	Enterprise model management

Triton Inference Server deserves special attention for enterprise deployments. Unlike the other frameworks that focus on serving a single model type, Triton manages multiple models simultaneously, supports model versioning, implements model pipelines (where the output of one model feeds into another), and provides built-in metrics for monitoring. It supports TensorRT, ONNX, PyTorch, and TensorFlow backends, making it a universal serving platform. The tradeoff is complexity—Triton's configuration system has a steep learning curve.

SGLang introduces RadixAttention, which extends prefix caching to a radix tree structure. When many requests share common prefixes (e.g., system prompts, few-shot examples), SGLang caches the KV cache for these prefixes in a tree structure and reuses them across requests. For applications with structured prompt patterns, this can provide 2-5x throughput improvements over standard prefix caching. SGLang also excels at constrained generation (e.g., generating valid JSON) through its frontend DSL.

33.11.2 vLLM Deep Dive

vLLM has become the de facto standard for LLM serving due to its combination of PagedAttention, continuous batching, and broad model support:

from vllm import LLM, SamplingParams

# Load model with vLLM
llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.90,
    max_model_len=8192,
)

# Define sampling parameters
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256,
)

# Generate (vLLM handles batching automatically)
outputs = llm.generate(["What is deep learning?"], params)

For production serving, vLLM provides an OpenAI-compatible API server:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.90

33.11.3 llama.cpp and GGUF

For deployment on consumer hardware (CPUs, Apple Silicon, single consumer GPUs), llama.cpp provides highly optimized C/C++ inference:

GGUF format: Efficient model file format with built-in quantization (Q4_0, Q4_K_M, Q5_K_M, Q8_0, etc.).
CPU optimization: AVX2, AVX-512, ARM NEON optimizations.
Metal/CUDA support: GPU acceleration on Apple Silicon and NVIDIA GPUs.
Memory mapping: Models are memory-mapped for efficient loading.

33.11.4 Choosing a Framework

Decision criteria: 1. Hardware: NVIDIA GPUs → vLLM or TensorRT-LLM. Apple Silicon → llama.cpp. Mixed → ONNX Runtime. 2. Scale: Single request → llama.cpp. Thousands of concurrent requests → vLLM. 3. Model support: HuggingFace models → TGI or vLLM. Custom models → TensorRT-LLM. 4. Optimization priority: Latency → TensorRT-LLM. Throughput → vLLM. Ease of use → TGI or Ollama.

See Example 33.3 (code/example-03-model-serving.py) for a model serving implementation.

33.12 Latency vs. Throughput Optimization

33.12.1 The Fundamental Tradeoff

Latency and throughput are inherently in tension:

Latency (time per request): Minimized by dedicating all resources to a single request.
Throughput (requests per second): Maximized by batching many requests together to amortize GPU overhead.

Increasing batch size improves throughput but increases latency for individual requests. The optimal operating point depends on the application:

Application	Priority	Target
Chatbot (streaming)	Latency	TTFT < 500ms, 30+ TPS
Batch document processing	Throughput	Max requests/sec per GPU
Code completion (inline)	Latency	TTFT < 200ms
Content moderation	Throughput	Cost per 1K requests
Real-time translation	Both	TTFT < 300ms, 50+ TPS

33.12.2 Latency Optimization Techniques

Latency optimization focuses on making each individual request as fast as possible:

Model quantization: Reduces memory reads per token, directly accelerating decode. INT4 quantization can provide ~3x latency improvement over FP16 (as we discussed in Section 33.3).
Speculative decoding: Generates multiple tokens per target model pass, achieving 2-3x speedup (Section 33.8).
Tensor parallelism: Splits model across GPUs to parallelize computation within a single request. Reduces latency proportionally to the number of GPUs (with some communication overhead).
KV cache optimization: GQA reduces cache size, faster cache access reduces memory pressure.
torch.compile / CUDA graphs: Reduce kernel launch overhead. CUDA graphs capture a sequence of kernel launches and replay them with minimal CPU overhead, which is particularly effective for the repetitive decode phase.
Smaller models via distillation: A well-distilled 7B model can match a 70B model on specific tasks with 10x lower latency (Section 33.4).

Estimating latency. A useful rule of thumb for decode latency:

$$\text{Time per token} \approx \frac{\text{Model size (bytes)}}{\text{Memory bandwidth (bytes/sec)}}$$

For a 70B model in INT4 (~35 GB) on an H100 (3.35 TB/s bandwidth): $35 / 3350 \approx 10.4$ ms per token, or ~96 tokens per second. This is a theoretical upper bound—real-world performance includes overhead from KV cache reads, kernel launches, and other factors, typically yielding 60-80% of theoretical throughput.

33.12.3 Throughput Optimization Techniques

Throughput optimization focuses on maximizing the total number of requests handled per unit time:

Continuous batching: Maximize GPU utilization by always having work ready. This alone can provide 3-10x throughput improvement over static batching (Section 33.9.2).
PagedAttention: Reduce memory waste to fit more concurrent requests (Section 33.7).
Pipeline parallelism: Process different micro-batches at different pipeline stages simultaneously. While tensor parallelism reduces latency for a single request, pipeline parallelism increases throughput by overlapping computation.
Prefix caching: Avoid redundant computation for shared system prompts. If every request shares a 1000-token system prompt, caching its KV values saves 1000 prefill tokens per request.
Request queuing and scheduling: Prioritize short requests to reduce average wait time. Shortest-job-first scheduling minimizes average latency at the cost of potentially starving long requests.

33.13 Cost Analysis

33.13.1 Cost Model

The total cost of model serving involves:

$$\text{Total Cost} = \text{Hardware Cost} + \text{Infrastructure Cost} + \text{Operations Cost}$$

Hardware Cost: GPU instances are the dominant expense.

GPU	Memory	Cost (cloud, per hour)	Tokens/sec (70B, INT4)
A100 40GB	40 GB	~$3.50	~30
A100 80GB	80 GB	~$5.00	~35
H100 80GB	80 GB	~$8.00	~80
L40S	48 GB	~$2.50	~20
A10G	24 GB	~$1.50	~10 (7B model)

Cost per token:

$$\text{Cost per 1M tokens} = \frac{\text{GPU cost/hour}}{(\text{Tokens/sec}) \times 3600} \times 10^6$$

For an H100 serving a 70B INT4 model at 80 tokens/sec:

$$\text{Cost per 1M tokens} = \frac{\$8.00}{80 \times 3600} \times 10^6 = \$27.78$$

33.13.2 Optimization Impact on Cost

Optimization	Throughput Impact	Cost Reduction
INT8 quantization	1.5-2x	33-50%
INT4 quantization	2-3x	50-67%
vLLM (vs. naive serving)	2-4x	50-75%
Continuous batching	3-10x	67-90%
Speculative decoding	1.5-2.5x	33-60%
Combined optimizations	10-24x	90-96%

33.13.3 Cost Optimization Strategies

Beyond the per-technique optimizations discussed throughout this chapter, several system-level strategies can dramatically reduce costs:

Semantic caching: Cache the results of semantically similar queries. Unlike exact-match caching, semantic caching uses embedding similarity to identify queries that are close enough to return the same answer. For FAQ-style applications, this can eliminate 30-60% of LLM calls.

Model cascading: Route simple queries to a small, cheap model and only escalate to a large model when the small model is uncertain. A 7B model might handle 70% of queries at 1/10th the cost of a 70B model. The key is a reliable confidence estimator.

Prompt optimization: Shorter prompts mean fewer input tokens and lower cost. Techniques include: - Remove redundant instructions from system prompts. - Use abbreviations or shorthand in few-shot examples. - Truncate retrieved context to only the most relevant passages. - Use structured prompts instead of verbose natural language.

Off-peak scheduling: For batch workloads, schedule inference during off-peak hours when spot GPU instances are cheapest. Cloud providers often offer 60-70% discounts for interruptible instances.

Right-sizing models: Not every task requires a 70B model. Systematically benchmark your specific tasks across model sizes to find the smallest model that meets your quality threshold. Many production systems use 7B-13B models after task-specific fine-tuning.

33.13.4 Build vs. Buy Analysis

Self-hosting vs. API providers:

Self-hosting advantages: - Lower cost at high volume (break-even typically at 50M+ tokens/day). - Full control over model, data, and infrastructure. - Customization (fine-tuned models, custom quantization). - No data sent to third parties.

API provider advantages: - Zero infrastructure management. - Automatic scaling. - Access to the latest models without deployment effort. - Lower cost at low-to-moderate volume.

Break-even analysis: $$\text{Self-host when: } \frac{\text{Daily API Cost}}{\text{GPU Cost/Day}} > 1$$

For a single H100 at $192/day, self-hosting is economical when API costs would exceed approximately $200/day, which corresponds to roughly 7-10M tokens/day at typical API pricing.

33.14 Production Deployment Patterns

33.14.1 Deployment Architectures

Single-Model Serving:

Load Balancer → [vLLM Instance 1, vLLM Instance 2, ...] → GPU Cluster

Multi-Model Serving:

Router → Model A (high priority, latency-optimized)
       → Model B (batch processing, throughput-optimized)
       → Model C (lightweight, edge deployment)

Model Cascade:

Request → Small Model → Confidence Check → High Confidence → Return
                                         → Low Confidence → Large Model → Return

A cascade first tries a small, cheap model. If the small model is confident in its answer, return immediately. Otherwise, escalate to a larger, more expensive model. This can reduce average cost by 50-80% if the small model handles the majority of requests.

33.14.2 Auto-Scaling

Key metrics for auto-scaling decisions: - GPU utilization: Scale up when sustained > 80%. - Queue depth: Scale up when requests are waiting. - Latency P95: Scale up when 95th percentile latency exceeds SLA. - Request rate: Predictive scaling based on historical traffic patterns.

Scale-down considerations: - GPU instances are expensive even when idle. - Model loading time (30s-5min) means instances cannot scale up instantly. - Keep warm instances for baseline traffic; use burst instances for peaks.

33.14.3 Health Monitoring

Critical metrics to monitor: - TTFT (Time to First Token): End-to-end latency from request to first generated token. - TPS (Tokens per Second): Generation speed per request. - Throughput: Total tokens generated per second across all requests. - GPU Memory: KV cache utilization, fragmentation. - Queue Depth: Number of requests waiting for processing. - Error Rate: Failed requests, timeouts, OOM errors.

33.15 Putting It All Together: An Optimization Workflow

Given the many optimization techniques in this chapter, a practitioner may wonder: where should I start? Here is a systematic workflow for optimizing LLM inference for a specific deployment scenario.

Step 1: Profile and Identify the Bottleneck

Before optimizing, measure: - Is the workload compute-bound (long prefills, high batch sizes) or memory-bandwidth-bound (single-request decode)? - What is the current TTFT and TPS? - What is the GPU utilization (compute and memory bandwidth)? - What is the KV cache utilization?

Use tools like nvidia-smi, PyTorch Profiler, or vLLM's built-in metrics to gather this data.

Step 2: Apply Low-Effort, High-Impact Optimizations First

Switch to a serving framework (vLLM, TGI) if you are using naive HuggingFace generate(). This alone provides 3-5x throughput improvement from continuous batching and PagedAttention.
Enable Flash Attention (usually automatic in modern frameworks).
Apply INT8 quantization if memory is a constraint. Near-zero accuracy loss with 2x memory savings.

Step 3: Apply Medium-Effort Optimizations

INT4 quantization (AWQ or GPTQ) if you need to fit the model on fewer GPUs or increase batch size.
Speculative decoding if latency is critical and you have access to a smaller model from the same family.
Prefix caching if requests share common system prompts.

Step 4: Apply High-Effort Optimizations

Knowledge distillation to create a task-specific smaller model.
TensorRT-LLM compilation for maximum throughput on NVIDIA hardware.
Model cascading to route simple requests to a cheaper model.
Custom CUDA kernels for domain-specific operations (rarely needed).

Step 5: Validate and Monitor

After each optimization, validate that accuracy has not degraded beyond acceptable thresholds. Run your evaluation suite (as discussed in Chapter 31 for RAG systems and Chapter 32 for agents) and confirm that quality metrics remain within bounds. In production, continuously monitor latency percentiles, throughput, and error rates.

33.16 Summary

Inference optimization is the bridge between research models and production applications. Without the techniques in this chapter, the powerful models we built in earlier parts of this book would remain impractical luxuries—too slow, too expensive, and too memory-hungry for real-world deployment. The key techniques covered in this chapter are:

Quantization (INT8, INT4, GPTQ, AWQ): Reduces model size and memory bandwidth, enabling larger models on fewer GPUs with 2-4x speedups.
Knowledge distillation: Creates smaller, faster models that retain most of the teacher's capability, offering 4-10x size reductions.
Pruning: Removes unnecessary parameters, with structured pruning (2:4 sparsity) offering hardware-accelerated 2x speedup.
KV cache optimization: GQA, quantization, and compression reduce the memory bottleneck that limits batch size and sequence length.
PagedAttention/vLLM: Eliminates memory fragmentation, enabling 2-4x throughput improvement.
Speculative decoding: Generates multiple tokens per target model pass with no quality loss, achieving 2-3x speedup.
Batching strategies: Continuous batching and chunked prefill maximize GPU utilization.
Compilation (TensorRT, Flash Attention, torch.compile): Low-level optimizations that reduce per-operation overhead.

These techniques are composable: combining quantization, PagedAttention, continuous batching, and speculative decoding can yield 10-24x cost reduction compared to naive serving.

The field continues to evolve rapidly, with new techniques emerging regularly. The engineering fundamentals remain constant: understand the bottleneck (compute vs. memory bandwidth), measure before optimizing, and choose optimizations that align with your application's latency and throughput requirements.

References

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers." ICLR 2023. https://arxiv.org/abs/2210.17323
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. (2023). "AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024. https://arxiv.org/abs/2306.00978
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. https://arxiv.org/abs/2309.06180
Leviathan, Y., Kalman, M., and Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. https://arxiv.org/abs/2211.17192
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv preprint arXiv:2302.01318.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. https://arxiv.org/abs/2205.14135
Dao, T. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." ICLR 2024. https://arxiv.org/abs/2307.08691
Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y., Ratner, A., Krishna, R., Lee, C.-Y., and Pfister, T. (2023). "Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes." ACL 2023.
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS 2023. https://arxiv.org/abs/2305.14314
Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." NIPS 2014 Deep Learning Workshop.