Chapter 33 Exercises: Inference Optimization and Model Serving

Section 1: Quantization

Exercise 33.1: Quantization Fundamentals

Implement symmetric linear quantization and dequantization functions for INT8 and INT4. Given a random FP32 tensor of shape (512, 512), quantize it to INT8 and INT4, then dequantize back. Compute the mean absolute error (MAE) and max absolute error between the original and dequantized tensors. Plot the error distribution as a histogram.

Exercise 33.2: Per-Channel vs. Per-Tensor Quantization

Compare per-tensor and per-channel INT8 quantization on a weight matrix from a pre-trained model (use any small HuggingFace model's linear layer weights). Measure the quantization error (MSE) for each approach. Explain why per-channel quantization is more accurate and visualize the scale factor distribution across channels.

Exercise 33.3: Quantization-Aware Outlier Analysis

Load a small transformer model and compute the distribution of weight values for each layer. Identify layers with outlier weights (values > 3 standard deviations from the mean). Analyze how these outliers affect INT8 quantization error. Implement a mixed-precision strategy that uses FP16 for layers with many outliers and INT8 for the rest.

Exercise 33.4: GPTQ Simulation

Implement a simplified version of the GPTQ algorithm for a single linear layer. Given a weight matrix W and a calibration dataset X, implement: (a) column-wise quantization, (b) Hessian computation ($H = 2X^TX$), (c) error compensation using the Hessian inverse. Compare the output error ($\|WX - \hat{W}X\|$) against naive round-to-nearest quantization.

Exercise 33.5: Quantization Format Comparison

Quantize a small language model (e.g., GPT-2 or a 125M parameter model) using three methods: (a) naive round-to-nearest INT8, (b) bitsandbytes NF4, and (c) GPTQ INT4. Evaluate perplexity on a test dataset (WikiText-2 or C4) for each. Create a table comparing model size, perplexity, and inference speed.

Section 2: Knowledge Distillation and Pruning

Exercise 33.6: Basic Knowledge Distillation

Implement knowledge distillation for a text classification task. Use a pre-trained BERT-base as the teacher and a 2-layer transformer as the student. Train the student with: (a) standard cross-entropy loss only, (b) KL divergence distillation loss only, (c) combined loss with alpha=0.5. Compare accuracy for each approach. Experiment with temperatures T=1, 2, 5, 10.

Exercise 33.7: Data Distillation

Use a large language model (via API or local) to generate a synthetic training dataset of 1,000 question-answer pairs for a specific domain (e.g., science trivia). Fine-tune a small model on this synthetic dataset and evaluate its performance compared to a model trained on the same number of human-written examples. Analyze the quality characteristics of synthetic vs. human data.

Exercise 33.8: Magnitude Pruning

Implement iterative magnitude pruning on a pre-trained small model. Start at 0% sparsity and incrementally prune to 30%, 50%, 70%, and 90% sparsity. At each level, measure: (a) model accuracy on a validation set, (b) actual model size (with sparse storage), (c) inference latency. Plot the accuracy-sparsity tradeoff curve.

Exercise 33.9: Structured Pruning (Attention Heads)

Analyze the importance of attention heads in a pre-trained transformer using the "head importance" metric (gradient-based or Taylor expansion). Prune the least important 25% and 50% of heads. Evaluate the model before and after pruning on a downstream task. Which layers' heads are most prunable?

Exercise 33.10: 2:4 Sparsity Simulation

Implement the N:M sparsity pattern (2:4) for a weight matrix: for every group of 4 consecutive weights, zero out the 2 with smallest magnitude. Measure the approximation error compared to the original matrix. Apply this to each linear layer in a small model and evaluate the accuracy impact. Compare to unstructured 50% sparsity.

Section 3: KV Cache and Attention Optimization

Exercise 33.11: KV Cache Memory Calculator

Write a function that calculates the KV cache memory requirements given model parameters (num_layers, num_heads, head_dim, dtype) and inference parameters (batch_size, max_seq_len). Compute requirements for Llama-3-8B, Llama-3-70B, and Mixtral-8x7B at batch sizes 1, 8, 32, and 128. Generate a table and identify the configuration limits for A100 80GB and H100 80GB GPUs.

Exercise 33.12: KV Cache Quantization

Implement INT8 quantization for KV cache tensors. Generate synthetic KV cache tensors matching the shape of a real model's cache. Compare: (a) FP16 cache, (b) INT8 per-tensor quantized cache, (c) INT8 per-head quantized cache. Measure memory savings and reconstruction error. Visualize the error distribution per head.

Exercise 33.13: Sliding Window Attention

Implement a sliding window attention mechanism where each token attends to only the most recent W tokens (plus any "anchor" tokens like the beginning-of-sequence token). Compare memory usage and output quality against full attention for a sequence of 2048 tokens with window sizes W=128, 256, 512, and 1024.

Exercise 33.14: Multi-Query vs. Grouped-Query Attention

Implement three attention variants from scratch in PyTorch: (a) Multi-Head Attention (MHA), (b) Multi-Query Attention (MQA), and (c) Grouped-Query Attention (GQA) with 8 groups. Compare KV cache size, forward pass latency, and output similarity (cosine similarity) for each variant with 32 attention heads and head dimension 128.

Exercise 33.15: Token Eviction Strategy

Implement the H2O (Heavy-Hitter Oracle) token eviction strategy: maintain a fixed-size KV cache by evicting tokens with the lowest accumulated attention scores. Test on a model processing a 4096-token input with a cache budget of 1024 tokens. Compare the output quality (perplexity or next-token accuracy) against full cache and random eviction baselines.

Section 4: Speculative Decoding and Batching

Exercise 33.16: Speculative Decoding Simulation

Implement the speculative decoding verification algorithm. Simulate a draft model and target model using probability distributions over a vocabulary of 100 tokens. For each candidate token, implement the acceptance-rejection scheme. Measure acceptance rate and effective speedup for draft models with varying quality levels (80%, 85%, 90%, 95% agreement with target).

Exercise 33.17: Draft Model Analysis

Compare three draft model strategies for a given target model: (a) a smaller model from the same family, (b) a quantized version of the target model, and (c) an n-gram model trained on the same data. For each, measure acceptance rate and wall-clock speedup. Analyze under what conditions each strategy is optimal.

Exercise 33.18: Continuous Batching Simulator

Build a discrete-event simulator for a continuous batching inference server. Model: (a) Poisson request arrivals with configurable rate, (b) variable prompt lengths (normal distribution, mean=500, std=200 tokens), (c) variable output lengths (normal distribution, mean=200, std=100 tokens). Compare throughput and latency distributions for static batching (batch sizes 1, 4, 8, 16) vs. continuous batching.

Exercise 33.19: Chunked Prefill Analysis

Extend the simulator from Exercise 33.18 to support chunked prefill. When a new request with a long prompt (>1024 tokens) arrives, split its prefill into chunks of 256 tokens. Measure the impact on: (a) decode latency for in-flight requests, (b) time-to-first-token for the new request, (c) overall throughput. Compare chunk sizes of 128, 256, 512, and 1024.

Exercise 33.20: Request Scheduling

Implement three request scheduling strategies for an inference server: (a) FIFO (first-in, first-out), (b) shortest-job-first (prioritize requests with shorter expected output), (c) priority-based (user-defined priority levels). Using the simulator from Exercise 33.18, compare average latency, P95 latency, and fairness metrics for each strategy under varying load conditions.

Section 5: Serving Frameworks and Production

Exercise 33.21: vLLM Benchmarking

Deploy a small model (e.g., Llama-3-8B or a smaller model) using vLLM. Benchmark it with: (a) varying batch sizes (1, 4, 8, 16, 32), (b) varying input lengths (128, 512, 2048, 8192 tokens), (c) varying output lengths (64, 256, 1024 tokens). Plot throughput (tokens/sec) and latency (TTFT, TPS) for each configuration. Identify the optimal operating point.

Exercise 33.22: TensorRT-LLM vs. vLLM Comparison

Deploy the same model on both vLLM and TensorRT-LLM (or simulate the comparison using published benchmarks). Compare: (a) throughput at various batch sizes, (b) TTFT at various input lengths, (c) GPU memory utilization. Under what conditions does each framework excel?

Exercise 33.23: Model Cascade Design

Implement a model cascade that uses a small model (7B) as the first tier and a large model (70B, simulated) as the second tier. The cascade should route to the large model when the small model's confidence (max output probability) is below a threshold. Experiment with thresholds from 0.5 to 0.95. Plot the accuracy-cost tradeoff curve.

Exercise 33.24: Auto-Scaling Simulator

Build an auto-scaling simulator for a GPU inference cluster. Model: (a) time-varying request rates (sinusoidal daily pattern + random spikes), (b) GPU instance spin-up time (2 minutes), (c) instance cost ($8/hour). Implement scaling policies: reactive (scale on P95 latency), predictive (scale on forecast traffic), and hybrid. Compare total cost and SLA compliance for each.

Exercise 33.25: Cost-Performance Analysis

Create a comprehensive cost analysis tool that, given a model and deployment configuration, calculates: (a) cost per 1M input tokens, (b) cost per 1M output tokens, (c) daily cost at a given request rate, (d) break-even point vs. API pricing. Run the analysis for three scenarios: low volume (10K requests/day), medium volume (100K requests/day), and high volume (1M requests/day). Compare self-hosting (A100, H100) vs. API providers.

Section 6: Advanced Optimization

Exercise 33.26: Flash Attention Implementation

Implement a simplified version of Flash Attention for a single attention head. Compare memory usage and execution time against the standard attention implementation ($\text{softmax}(QK^T / \sqrt{d_k}) V$) for sequence lengths 512, 1024, 2048, and 4096. Verify numerical equivalence by comparing outputs (tolerance: 1e-5).

Exercise 33.27: Operator Fusion

Implement two fused operations: (a) fused linear + ReLU (combine the matrix multiply and activation into a single function), and (b) fused softmax + dropout. Compare the fused and unfused implementations in terms of memory usage (peak allocations) and execution time. Use torch.profiler to analyze kernel launch counts.

Exercise 33.28: Memory-Efficient Inference Pipeline

Build an inference pipeline that implements all of the following simultaneously: INT8 weight quantization, KV cache quantization, sliding window attention, and memory-mapped model loading. Measure peak GPU memory usage and compare to a naive FP16 implementation for a small model. Calculate the theoretical maximum batch size for each approach on a 24GB GPU.

Exercise 33.29: End-to-End Optimization Benchmark

Create a benchmarking framework that evaluates inference performance across multiple optimization combinations. Test at least 4 configurations (e.g., FP16 baseline, INT8, INT4+vLLM, INT4+speculative), measuring: TTFT, TPS, throughput, GPU utilization, and memory usage. Present results as a radar chart showing each configuration's strengths.

Exercise 33.30: Production Monitoring Dashboard

Design and implement a monitoring system for a model serving deployment. Track: (a) TTFT and TPS distributions over time, (b) GPU utilization and memory, (c) request queue depth, (d) error rates and types, (e) cost per request. Generate an alert when any metric exceeds a configurable threshold. Output a summary report with recommendations for optimization.