Case Study 2: Optimizing Inference Latency for Production
Chapter 33: Inference Optimization and Model Serving
Overview
Organization: QuickCode, a developer tools company providing an AI-powered inline code completion assistant integrated into popular IDEs (VS Code, JetBrains, Neovim). Challenge: Code completion requires extremely low latency—completions must appear within 200ms of the user pausing their typing, or the experience feels sluggish and breaks the developer's flow state. The existing deployment using a 13B parameter model on A100s achieved 350ms P50 latency, which users reported as "noticeably slow." Goal: Reduce P95 latency below 200ms while maintaining completion quality and handling 10,000 concurrent developer sessions at peak.
Problem Analysis
Inline code completion has the most stringent latency requirements of any LLM application:
| Metric | Target | Current | Gap |
|---|---|---|---|
| Time to First Token (P50) | < 100ms | 180ms | 80ms |
| Time to First Token (P95) | < 200ms | 420ms | 220ms |
| Tokens per Second | > 100 | 55 | 45 |
| Completion Quality (HumanEval pass@1) | > 45% | 48.2% | Met |
| Concurrent Sessions | 10,000 | 3,000 | 7,000 |
The existing architecture: - Model: Custom 13B code model fine-tuned from Code Llama 13B. - Hardware: 8x A100 40GB GPUs across 2 nodes. - Framework: Hugging Face TGI with default settings. - Input: 512-2048 tokens of code context. - Output: 20-80 tokens of completion (short outputs).
Latency Breakdown
The team profiled the inference pipeline to identify bottlenecks:
| Component | Time (P50) | Time (P95) | % of Total |
|---|---|---|---|
| Network (client to server) | 15ms | 45ms | 8% |
| Request queuing | 5ms | 120ms | 3% / 29% |
| Tokenization | 2ms | 3ms | 1% |
| Prefill (1024 tokens avg) | 45ms | 85ms | 25% |
| First decode step | 25ms | 35ms | 14% |
| KV cache overhead | 8ms | 20ms | 4% |
| Generation (20 tokens avg) | 80ms | 112ms | 44% |
Key findings: 1. Queuing was the P95 killer. At peak load, requests waited 120ms+ in the queue. 2. Prefill was expensive. Code context averaging 1024 tokens required significant compute. 3. Generation speed was inadequate. 55 tokens/sec meant 20 tokens took 360ms. 4. Batch contention. Large batches improved throughput but degraded per-request latency.
Optimization Strategy
The team implemented optimizations in phases, measuring impact after each:
Phase 1: Model Optimization (Week 1-2)
1a. INT8 Quantization
Quantized the 13B model to INT8 using bitsandbytes:
| Metric | FP16 | INT8 | Change |
|---|---|---|---|
| Model Size | 26 GB | 13 GB | -50% |
| TTFT (P50) | 180ms | 125ms | -31% |
| TPS | 55 | 78 | +42% |
| HumanEval pass@1 | 48.2% | 47.8% | -0.4% |
INT8 delivered meaningful improvement but was insufficient to meet the 200ms P95 target.
1b. Speculative Decoding
Trained a 1B parameter draft model on the same code corpus. The draft model shared the 13B model's vocabulary and was architecturally a scaled-down version (12 layers, 16 heads).
Draft model training: - Architecture: 1B parameter transformer (same tokenizer as 13B). - Training data: 50B tokens of code (same distribution as 13B model). - Training time: 3 days on 8x A100. - Draft tokens per step ($K$): 5.
Speculative decoding results:
| Metric | INT8 Only | INT8 + Speculative | Change |
|---|---|---|---|
| TTFT (P50) | 125ms | 125ms | 0% (prefill unchanged) |
| TPS | 78 | 165 | +112% |
| Acceptance Rate | N/A | 82% | - |
| HumanEval pass@1 | 47.8% | 47.8% | 0% (exact match guaranteed) |
Speculative decoding more than doubled the generation speed. With 82% acceptance rate, the draft model agreed with the target model on 4.1 out of 5 proposed tokens on average.
Phase 2: Serving Optimization (Week 2-3)
2a. Migration to vLLM
Migrated from Hugging Face TGI to vLLM for PagedAttention and continuous batching:
| Metric | TGI | vLLM | Change |
|---|---|---|---|
| Max concurrent requests | 32 | 128 | +4x |
| TTFT (P95) | 380ms | 180ms | -53% |
| Throughput (tokens/sec total) | 4,400 | 11,200 | +155% |
| GPU memory utilization | 72% | 91% | +19% |
The PagedAttention-based memory management eliminated fragmentation, allowing 4x more concurrent requests. Continuous batching eliminated queuing delays.
2b. Prefix Caching
Code completion requests shared a common prompt template (formatting instructions and language detection). At 200 tokens, this represented 10-20% of typical input. Prefix caching eliminated redundant prefill for this shared portion.
Impact: TTFT reduced by 15-20ms on average.
2c. CUDA Graphs
Enabled CUDA graph capture for the decode phase using torch.compile with mode="reduce-overhead":
# torch.compile reduces kernel launch overhead for the decode loop
model.decode = torch.compile(model.decode, mode="reduce-overhead")
This reduced per-step decode overhead by 30%, as the kernel launch overhead (which dominates at small batch sizes) was eliminated by replaying captured CUDA graphs.
Phase 3: Infrastructure Optimization (Week 3-4)
3a. Geographic Distribution
Deployed inference nodes in 3 regions (US-East, US-West, EU-West) to reduce network latency from 45ms (P95 cross-continent) to 15ms (P95 same-region).
3b. Request Prioritization
Implemented a priority scheduler that: 1. Gave highest priority to the most recently typed request from each user (older requests were likely stale). 2. Canceled in-flight requests when a new request arrived from the same user (the user typed more, making the old completion irrelevant). 3. Shed load during extreme spikes by returning empty completions rather than queuing.
This reduced effective P95 TTFT by 40ms by eliminating stale request processing.
3c. Client-Side Optimization
- Speculative prefetch: The IDE plugin prefetched completions as the user typed, predicting likely pause points.
- Debouncing: Requests were only sent after a 50ms typing pause to avoid unnecessary calls during fast typing.
- Caching: Recent completions were cached client-side and reused if the user typed a prefix match.
Final Results
End-to-End Performance
| Metric | Before | After | Target | Met? |
|---|---|---|---|---|
| TTFT (P50) | 180ms | 52ms | < 100ms | Yes |
| TTFT (P95) | 420ms | 135ms | < 200ms | Yes |
| TPS per request | 55 | 165 | > 100 | Yes |
| Concurrent Sessions | 3,000 | 12,000 | 10,000 | Yes |
| HumanEval pass@1 | 48.2% | 47.8% | > 45% | Yes |
| GPU count (total) | 8x A100 | 12x A100* | - | - |
| Cost per completion | $0.0008 | $0.0003 | < $0.0005 | Yes |
* 12 GPUs across 3 regions, but each GPU served 3x more requests.
Optimization Impact Breakdown
| Optimization | TTFT Impact | TPS Impact | Cost Impact |
|---|---|---|---|
| INT8 quantization | -31% | +42% | -30% |
| Speculative decoding | 0% | +112% | -20% |
| vLLM migration | -53% (P95) | +155% | -40% |
| Prefix caching | -12% | 0% | -5% |
| CUDA graphs | -8% | +15% | 0% |
| Geo-distribution | -65% (network) | 0% | +20% (more nodes) |
| Request prioritization | -22% (effective) | 0% | 0% |
| Combined | -71% (P50), -68% (P95) | +200% | -63% |
User Satisfaction
An internal A/B test with 500 developers over 4 weeks:
| Metric | Before | After |
|---|---|---|
| "Completions feel instant" | 23% | 78% |
| "Completions feel slow" | 52% | 8% |
| Completion acceptance rate | 28% | 41% |
| Developer NPS | +12 | +54 |
The 13-percentage-point increase in completion acceptance rate was the most impactful business metric: developers were accepting more completions because they arrived fast enough to be useful.
Technical Deep Dive: Speculative Decoding for Code
Why Code Is Ideal for Speculative Decoding
Code completion is particularly well-suited for speculative decoding because:
- High predictability: Much of code follows deterministic patterns (closing brackets, function signatures, common idioms). The draft model easily predicts these.
- Short outputs: Completions average 20-40 tokens. Even at 82% acceptance rate, the number of rejected tokens per completion is small.
- Shared vocabulary: Code tokens are a subset of the general vocabulary, making vocabulary agreement trivial.
Acceptance Rate by Code Type
| Code Pattern | Acceptance Rate | Examples |
|---|---|---|
| Closing brackets/syntax | 98% | }, );, ] |
| Common patterns | 91% | for i in range(, if err != nil { |
| Variable names | 78% | Context-dependent identifiers |
| Complex logic | 65% | Novel algorithmic code |
| Natural language comments | 72% | Docstrings, comments |
| Overall | 82% | Weighted average |
The high acceptance rate for syntactic and common patterns meant the draft model handled the "easy" tokens while the target model's compute was reserved for the "hard" tokens where it added the most value.
Draft Model Architecture Decision
The team considered three draft model options:
| Option | Size | Acceptance Rate | Draft Speed | Net Speedup |
|---|---|---|---|---|
| Code Llama 1B | 1B | 78% | 800 tok/s | 1.9x |
| Custom 1B (trained) | 1B | 82% | 800 tok/s | 2.1x |
| Quantized 13B (INT4) | 13B INT4 | 90% | 200 tok/s | 1.4x |
The custom 1B model, trained on the same code corpus as the 13B, provided the best balance of speed and agreement. The quantized 13B had higher acceptance but was too slow to provide meaningful speedup.
Lessons Learned
-
Latency optimization is a stack, not a single technique. No single optimization was sufficient. The 71% TTFT reduction came from combining 7 optimizations that each contributed 8-53%.
-
Measure the right thing. P50 latency was misleading—it was acceptable from the start. The user experience was dominated by P95 latency, which was 2x worse. Queuing was the primary P95 bottleneck, which was invisible at P50.
-
Speculative decoding is transformative for code. The 82% acceptance rate for code (higher than typical natural language at 70-75%) made speculative decoding the single most impactful optimization for generation speed.
-
Client-side optimizations compound with server-side. Debouncing, prefetching, and request cancellation reduced effective latency beyond what server-side optimizations achieved. The "fastest request is the one you don't make."
-
vLLM is a non-negotiable upgrade for production. The switch from TGI to vLLM delivered the largest single improvement. PagedAttention and continuous batching are foundational for any production LLM deployment.
-
Geographic distribution is table stakes for latency-sensitive applications. 30ms of network latency (the difference between same-region and cross-continent) represented 22% of the final P50 TTFT. No amount of model optimization can overcome speed-of-light constraints.
Key Takeaways
- Inline code completion requires sub-200ms P95 latency, demanding a stack of optimizations rather than any single technique.
- INT8 quantization + speculative decoding + vLLM forms a powerful optimization trinity, each addressing a different bottleneck.
- Code's high predictability makes it ideal for speculative decoding, with acceptance rates of 80%+ achievable.
- Request prioritization and cancellation are critical for interactive applications where stale requests waste resources.
- Client-side optimizations (debouncing, prefetching, caching) compound with server-side improvements.
- The total latency budget must account for network, queuing, prefill, and generation; optimizing only one component leaves others as bottlenecks.