Chapter 33 Quiz: Inference Optimization and Model Serving

Instructions

Choose the best answer for each question. Each question has exactly one correct answer unless otherwise specified.

Question 1

What is the primary bottleneck during the auto-regressive decode phase of transformer inference with batch size 1?

A) Compute (FLOPs) capacity of the GPU B) Memory bandwidth — reading model weights from GPU memory C) Network latency between CPU and GPU D) Tokenization speed

Answer: B Explanation: During decode with batch size 1, each token generation requires reading the entire model weights from GPU memory. The compute required per token is small relative to the data transfer. With a 70B FP16 model (140 GB) on an H100 (3.35 TB/s bandwidth), reading the weights alone takes ~42ms, making memory bandwidth the binding constraint.

Question 2

What is the key difference between the prefill and decode phases of transformer inference?

A) Prefill uses attention while decode does not B) Prefill is memory-bandwidth-bound while decode is compute-bound C) Prefill processes all input tokens in parallel (compute-bound) while decode generates tokens sequentially (memory-bandwidth-bound) D) Prefill only runs once during model loading

Answer: C Explanation: Prefill processes the entire input prompt in parallel with high arithmetic intensity, making it compute-bound. Decode generates one token at a time with low arithmetic intensity (each step reads all weights but performs relatively few operations), making it memory-bandwidth-bound.

Question 3

In symmetric linear quantization, what does the scale factor $s$ represent?

A) The number of bits used for quantization B) The mapping ratio between floating-point range and integer range C) The bias added to all quantized values D) The learning rate for quantization-aware training

Answer: B Explanation: The scale factor $s = \max(|x|) / (2^{b-1} - 1)$ maps the floating-point value range to the integer range. It determines how much each integer step represents in the original floating-point domain. Dividing by $s$ maps float to int; multiplying by $s$ maps back.

Question 4

What is the primary innovation of GPTQ compared to naive round-to-nearest quantization?

A) GPTQ uses floating-point quantization instead of integer B) GPTQ quantizes columns sequentially, using the Hessian to compensate for errors in subsequent columns C) GPTQ only quantizes attention layers, leaving MLPs in full precision D) GPTQ requires the model to be retrained after quantization

Answer: B Explanation: GPTQ processes weight columns sequentially, using the Hessian matrix (computed from calibration data activations) to determine the optimal quantization order and compensate for quantization errors introduced in one column by adjusting remaining columns. This layer-wise optimization minimizes the output error $\|WX - \hat{W}X\|$.

Question 5

How does AWQ (Activation-Aware Weight Quantization) differ from GPTQ?

A) AWQ uses a higher bit width than GPTQ B) AWQ protects "salient" weight channels (those with large activation magnitudes) by applying per-channel scaling before quantization C) AWQ is a training-time technique while GPTQ is post-training D) AWQ only works with INT8, not INT4

Answer: B Explanation: AWQ observes that some weight channels correspond to large activations and are disproportionately important for model quality. Rather than quantizing all channels equally, AWQ applies per-channel scaling to protect these salient channels, achieving comparable or better accuracy than GPTQ with faster quantization time.

Question 6

In knowledge distillation, what is the purpose of the temperature parameter $T$?

A) To control the learning rate during student training B) To soften the teacher's probability distribution, revealing more information about relative token probabilities C) To determine the batch size for training D) To set the maximum sequence length

Answer: B Explanation: Higher temperature $T$ produces softer probability distributions from the teacher model, making the probabilities of non-top tokens larger and more informative. This "dark knowledge" — the relative probabilities of incorrect tokens — encodes rich information about the task structure that helps the student learn beyond just the top prediction.

Question 7

What is the advantage of 2:4 (N:M) structured sparsity over unstructured sparsity?

A) 2:4 sparsity achieves higher compression ratios B) 2:4 sparsity is supported by GPU hardware (sparse tensor cores), enabling actual speedup without specialized sparse kernels C) 2:4 sparsity has zero accuracy loss D) 2:4 sparsity can be applied to attention mechanisms but not feed-forward layers

Answer: B Explanation: The 2:4 sparsity pattern (2 zeros out of every 4 elements) is natively supported by NVIDIA's sparse tensor cores (Ampere and later), providing approximately 2x speedup using standard hardware. Unstructured sparsity requires specialized sparse computation kernels and typically does not translate to real speedup on current hardware.

Question 8

What is the KV cache, and why does it grow with sequence length?

A) A cache of tokenizer vocabularies that grows as more words are encountered B) A cache of key and value tensors from previous tokens that grows linearly as each new token adds its own K and V entries C) A cache of gradient checkpoints used for backpropagation D) A cache of model weights that grows as more parameters are loaded

Answer: B Explanation: During auto-regressive generation, the model stores the key (K) and value (V) tensors from all previously processed tokens to avoid recomputing them. Each new token adds its own K and V entries across all layers, so the cache grows linearly with sequence length. For large models, this cache can consume tens of GB per request.

Question 9

How does Grouped-Query Attention (GQA) reduce KV cache memory compared to standard Multi-Head Attention (MHA)?

A) By using smaller head dimensions B) By sharing K and V projections across groups of attention heads, reducing the number of unique K/V vectors stored C) By eliminating the V projection entirely D) By compressing the cache using LZ4 compression

Answer: B Explanation: In standard MHA, each attention head has its own K and V projections, requiring $n_\text{heads}$ sets of K/V vectors. GQA groups heads together, with each group sharing a single set of K/V projections. With $g$ groups and $n$ heads, the KV cache is reduced by a factor of $n/g$, providing significant memory savings with minimal quality loss.

Question 10

What problem does PagedAttention (vLLM) solve?

A) The slow speed of attention computation B) Memory fragmentation in KV cache management that wastes 60-80% of allocated memory C) The quadratic memory cost of the attention matrix D) The need for multiple GPUs

Answer: B Explanation: Traditional KV cache management pre-allocates contiguous memory for the maximum possible sequence length of each request. This leads to internal fragmentation (most requests don't use the full allocation) and external fragmentation (free memory becomes scattered). PagedAttention manages KV cache in fixed-size blocks, eliminating fragmentation and enabling near-zero memory waste.

Question 11

In speculative decoding, what guarantees that the output distribution is identical to the target model's distribution?

A) The draft model is trained on the same data as the target model B) The acceptance-rejection scheme with residual sampling ensures statistical equivalence C) The draft model uses the same number of parameters D) Both models share the same random seed

Answer: B Explanation: Speculative decoding uses a mathematically rigorous acceptance-rejection scheme: accept draft tokens with probability min(1, q(x)/p(x)) and sample rejected positions from the residual distribution max(0, q(x)-p(x)). This guarantees that the final output distribution is exactly the target model's distribution, regardless of draft model quality.

Question 12

What is the primary advantage of continuous batching over static batching?

A) Continuous batching uses less GPU memory B) Continuous batching eliminates head-of-line blocking by dynamically adding and removing requests from the batch C) Continuous batching always produces higher quality outputs D) Continuous batching requires fewer GPUs

Answer: B Explanation: In static batching, all requests in a batch must wait for the longest request to complete. Continuous batching dynamically adds new requests and removes completed ones at each generation step, eliminating head-of-line blocking. This significantly improves both throughput and latency, especially when request lengths vary.

Question 13

Why is chunked prefill important for production serving?

A) It reduces model size B) It prevents long prompt prefills from blocking decode operations for in-flight requests, maintaining consistent generation latency C) It improves the quality of generated text D) It eliminates the need for KV cache

Answer: B Explanation: When a new request with a very long prompt arrives, processing the entire prompt in one step would block decode operations for all in-flight requests, causing latency spikes. Chunked prefill splits the prompt into smaller chunks and interleaves prefill chunks with decode steps, maintaining consistent generation latency for existing requests.

Question 14

What does Flash Attention optimize compared to standard attention?

A) The number of attention heads B) Memory I/O by tiling the computation to fit in GPU SRAM and fusing the entire attention operation into a single kernel C) The dimensionality of query and key vectors D) The vocabulary size of the model

Answer: B Explanation: Flash Attention optimizes the memory access pattern of attention computation. Instead of materializing the full N x N attention matrix in GPU HBM (slow global memory), it tiles the computation to fit in SRAM (fast on-chip memory) and fuses the softmax, masking, and value multiplication into a single kernel. This reduces HBM reads/writes and provides 2-4x speedup.

Question 15

For a model serving application with a latency SLA of TTFT < 500ms and TPS > 30, which optimization should be prioritized?

A) Maximum throughput batch processing B) Latency-focused optimizations: quantization, speculative decoding, and tensor parallelism C) Model retraining with more data D) Increasing the model size for better quality

Answer: B Explanation: The SLA requires low latency (fast first token) and fast generation speed. Quantization reduces the data transfer per token (improving TPS), speculative decoding generates multiple tokens per pass (improving TPS), and tensor parallelism reduces per-token computation time (improving TTFT). Throughput-focused batching would increase latency.

Question 16

What is the approximate cost per 1M tokens for an H100 serving a 70B INT4 model at 80 tokens/sec?

A) $0.28 B) $2.78 C) $27.78 D) $277.80

Answer: C Explanation: At $8/hour for the H100, generating 80 tokens/sec yields 288,000 tokens/hour. Cost per 1M tokens = $8.00 / 288,000 * 1,000,000 = $27.78. This illustrates why inference optimization (increasing tokens/sec) directly reduces cost per token.

Question 17

In a model cascade architecture, when should a request be escalated from the small model to the large model?

A) Always, to ensure maximum quality B) When the small model's output confidence (e.g., max probability) falls below a threshold C) When the request contains more than 100 tokens D) Randomly, to balance load between models

Answer: B Explanation: A model cascade uses the small model's confidence score as a routing signal. High-confidence responses are returned directly (fast and cheap). Low-confidence responses are escalated to the larger model for higher quality. The confidence threshold controls the accuracy-cost tradeoff: lower thresholds mean more escalations (higher cost, higher quality).

Question 18

What is the primary benefit of prefix caching in a serving framework?

A) It reduces model loading time B) It avoids redundant computation of shared system prompts across multiple requests C) It compresses the model weights D) It reduces the output token count

Answer: B Explanation: When many requests share the same system prompt (common in API deployments), prefix caching stores the KV cache for the shared prefix and reuses it across requests. This eliminates redundant prefill computation, reducing TTFT and compute cost proportionally to the length of the shared prefix.

Question 19

Which metric is most important for a streaming chatbot user experience?

A) Total throughput (requests per second across all users) B) Time to first token (TTFT) and tokens per second (TPS) for the individual request C) Model perplexity on a benchmark dataset D) GPU memory utilization

Answer: B Explanation: For a streaming chatbot, the user directly experiences TTFT (how quickly the response starts) and TPS (how fast tokens stream in). TTFT should be under 500ms and TPS should be 30+ for a smooth experience. Total system throughput matters for infrastructure economics but does not directly affect individual user experience.

Question 20

When does self-hosting become more cost-effective than using API providers?

A) Always, regardless of volume B) When daily API costs exceed the daily cost of GPU infrastructure (typically at 50M+ tokens/day) C) Only for models with more than 100B parameters D) When using CPU-only infrastructure

Answer: B Explanation: Self-hosting has high fixed costs (GPU rental, infrastructure, engineering) but low marginal costs per token. API providers have low fixed costs but higher per-token prices. The break-even point typically occurs around 50M+ tokens/day, where the marginal savings of self-hosting overcome the fixed costs.

Question 21

What is the effect of INT4 weight-only quantization on prefill vs. decode speed?

A) Equal speedup for both prefill and decode B) Large speedup for decode (memory-bandwidth-bound) but modest speedup for prefill (compute-bound) C) Large speedup for prefill but no effect on decode D) No speedup for either, only memory savings

Answer: B Explanation: INT4 weight-only quantization reduces the bytes per weight by 4x, significantly reducing memory bandwidth requirements during decode (which is memory-bandwidth-bound). During prefill, the workload is compute-bound—the bottleneck is FLOPs, not memory transfers—so reducing weight size provides only modest speedup from reduced data transfer.

Question 22

What does "arithmetic intensity" measure in the context of inference optimization?

A) The total number of arithmetic operations in the model B) The ratio of FLOPs to bytes transferred, determining whether a workload is compute-bound or memory-bandwidth-bound C) The accuracy of quantized arithmetic operations D) The clock speed of the GPU's arithmetic units

Answer: B Explanation: Arithmetic intensity = FLOPs / bytes transferred. A high ratio means the workload performs many operations per byte loaded (compute-bound). A low ratio means the workload loads many bytes per operation (memory-bandwidth-bound). This metric is fundamental for choosing appropriate optimization strategies.

Question 23

Which serving framework uses PagedAttention as its core innovation?

A) TensorRT-LLM B) vLLM C) ONNX Runtime D) llama.cpp

Answer: B Explanation: vLLM (from UC Berkeley) introduced PagedAttention as its core innovation for efficient KV cache memory management. While other frameworks have since adopted similar techniques, PagedAttention originated in and is most closely associated with vLLM.

Question 24

What is the optimal draft model for speculative decoding?

A) The largest available model B) A model that is much faster than the target model (4-10x fewer parameters) but has high agreement with the target model's outputs C) Any model with the same architecture as the target D) A model trained on a different dataset for diversity

Answer: B Explanation: The ideal draft model balances speed (must be significantly faster than the target to provide a net speedup) with quality (must have high acceptance rate, meaning it frequently agrees with the target model's predictions). The draft model must also share the same vocabulary. Common choices include smaller models from the same family or quantized versions of the target.

Question 25

What combination of techniques would achieve the highest cost reduction for a high-throughput serving deployment?

A) FP32 precision with static batching B) INT4 quantization + PagedAttention + continuous batching + prefix caching C) Larger model size with single-request inference D) CPU-only inference with maximum parallelism

Answer: B Explanation: The combination of INT4 quantization (2-3x speedup from reduced memory bandwidth), PagedAttention (near-zero memory waste enabling higher batch sizes), continuous batching (maximized GPU utilization), and prefix caching (avoid redundant computation) can yield 10-24x cost reduction compared to a naive deployment. These techniques are composable and address different bottlenecks.