Case Study 2: Optimizing Inference Latency for Production

Chapter 33: Inference Optimization and Model Serving


Overview

Organization: QuickCode, a developer tools company providing an AI-powered inline code completion assistant integrated into popular IDEs (VS Code, JetBrains, Neovim). Challenge: Code completion requires extremely low latency—completions must appear within 200ms of the user pausing their typing, or the experience feels sluggish and breaks the developer's flow state. The existing deployment using a 13B parameter model on A100s achieved 350ms P50 latency, which users reported as "noticeably slow." Goal: Reduce P95 latency below 200ms while maintaining completion quality and handling 10,000 concurrent developer sessions at peak.


Problem Analysis

Inline code completion has the most stringent latency requirements of any LLM application:

Metric Target Current Gap
Time to First Token (P50) < 100ms 180ms 80ms
Time to First Token (P95) < 200ms 420ms 220ms
Tokens per Second > 100 55 45
Completion Quality (HumanEval pass@1) > 45% 48.2% Met
Concurrent Sessions 10,000 3,000 7,000

The existing architecture: - Model: Custom 13B code model fine-tuned from Code Llama 13B. - Hardware: 8x A100 40GB GPUs across 2 nodes. - Framework: Hugging Face TGI with default settings. - Input: 512-2048 tokens of code context. - Output: 20-80 tokens of completion (short outputs).

Latency Breakdown

The team profiled the inference pipeline to identify bottlenecks:

Component Time (P50) Time (P95) % of Total
Network (client to server) 15ms 45ms 8%
Request queuing 5ms 120ms 3% / 29%
Tokenization 2ms 3ms 1%
Prefill (1024 tokens avg) 45ms 85ms 25%
First decode step 25ms 35ms 14%
KV cache overhead 8ms 20ms 4%
Generation (20 tokens avg) 80ms 112ms 44%

Key findings: 1. Queuing was the P95 killer. At peak load, requests waited 120ms+ in the queue. 2. Prefill was expensive. Code context averaging 1024 tokens required significant compute. 3. Generation speed was inadequate. 55 tokens/sec meant 20 tokens took 360ms. 4. Batch contention. Large batches improved throughput but degraded per-request latency.


Optimization Strategy

The team implemented optimizations in phases, measuring impact after each:

Phase 1: Model Optimization (Week 1-2)

1a. INT8 Quantization

Quantized the 13B model to INT8 using bitsandbytes:

Metric FP16 INT8 Change
Model Size 26 GB 13 GB -50%
TTFT (P50) 180ms 125ms -31%
TPS 55 78 +42%
HumanEval pass@1 48.2% 47.8% -0.4%

INT8 delivered meaningful improvement but was insufficient to meet the 200ms P95 target.

1b. Speculative Decoding

Trained a 1B parameter draft model on the same code corpus. The draft model shared the 13B model's vocabulary and was architecturally a scaled-down version (12 layers, 16 heads).

Draft model training: - Architecture: 1B parameter transformer (same tokenizer as 13B). - Training data: 50B tokens of code (same distribution as 13B model). - Training time: 3 days on 8x A100. - Draft tokens per step ($K$): 5.

Speculative decoding results:

Metric INT8 Only INT8 + Speculative Change
TTFT (P50) 125ms 125ms 0% (prefill unchanged)
TPS 78 165 +112%
Acceptance Rate N/A 82% -
HumanEval pass@1 47.8% 47.8% 0% (exact match guaranteed)

Speculative decoding more than doubled the generation speed. With 82% acceptance rate, the draft model agreed with the target model on 4.1 out of 5 proposed tokens on average.

Phase 2: Serving Optimization (Week 2-3)

2a. Migration to vLLM

Migrated from Hugging Face TGI to vLLM for PagedAttention and continuous batching:

Metric TGI vLLM Change
Max concurrent requests 32 128 +4x
TTFT (P95) 380ms 180ms -53%
Throughput (tokens/sec total) 4,400 11,200 +155%
GPU memory utilization 72% 91% +19%

The PagedAttention-based memory management eliminated fragmentation, allowing 4x more concurrent requests. Continuous batching eliminated queuing delays.

2b. Prefix Caching

Code completion requests shared a common prompt template (formatting instructions and language detection). At 200 tokens, this represented 10-20% of typical input. Prefix caching eliminated redundant prefill for this shared portion.

Impact: TTFT reduced by 15-20ms on average.

2c. CUDA Graphs

Enabled CUDA graph capture for the decode phase using torch.compile with mode="reduce-overhead":

# torch.compile reduces kernel launch overhead for the decode loop
model.decode = torch.compile(model.decode, mode="reduce-overhead")

This reduced per-step decode overhead by 30%, as the kernel launch overhead (which dominates at small batch sizes) was eliminated by replaying captured CUDA graphs.

Phase 3: Infrastructure Optimization (Week 3-4)

3a. Geographic Distribution

Deployed inference nodes in 3 regions (US-East, US-West, EU-West) to reduce network latency from 45ms (P95 cross-continent) to 15ms (P95 same-region).

3b. Request Prioritization

Implemented a priority scheduler that: 1. Gave highest priority to the most recently typed request from each user (older requests were likely stale). 2. Canceled in-flight requests when a new request arrived from the same user (the user typed more, making the old completion irrelevant). 3. Shed load during extreme spikes by returning empty completions rather than queuing.

This reduced effective P95 TTFT by 40ms by eliminating stale request processing.

3c. Client-Side Optimization

  • Speculative prefetch: The IDE plugin prefetched completions as the user typed, predicting likely pause points.
  • Debouncing: Requests were only sent after a 50ms typing pause to avoid unnecessary calls during fast typing.
  • Caching: Recent completions were cached client-side and reused if the user typed a prefix match.

Final Results

End-to-End Performance

Metric Before After Target Met?
TTFT (P50) 180ms 52ms < 100ms Yes
TTFT (P95) 420ms 135ms < 200ms Yes
TPS per request 55 165 > 100 Yes
Concurrent Sessions 3,000 12,000 10,000 Yes
HumanEval pass@1 48.2% 47.8% > 45% Yes
GPU count (total) 8x A100 12x A100* - -
Cost per completion $0.0008 | $0.0003 < $0.0005 Yes

* 12 GPUs across 3 regions, but each GPU served 3x more requests.

Optimization Impact Breakdown

Optimization TTFT Impact TPS Impact Cost Impact
INT8 quantization -31% +42% -30%
Speculative decoding 0% +112% -20%
vLLM migration -53% (P95) +155% -40%
Prefix caching -12% 0% -5%
CUDA graphs -8% +15% 0%
Geo-distribution -65% (network) 0% +20% (more nodes)
Request prioritization -22% (effective) 0% 0%
Combined -71% (P50), -68% (P95) +200% -63%

User Satisfaction

An internal A/B test with 500 developers over 4 weeks:

Metric Before After
"Completions feel instant" 23% 78%
"Completions feel slow" 52% 8%
Completion acceptance rate 28% 41%
Developer NPS +12 +54

The 13-percentage-point increase in completion acceptance rate was the most impactful business metric: developers were accepting more completions because they arrived fast enough to be useful.


Technical Deep Dive: Speculative Decoding for Code

Why Code Is Ideal for Speculative Decoding

Code completion is particularly well-suited for speculative decoding because:

  1. High predictability: Much of code follows deterministic patterns (closing brackets, function signatures, common idioms). The draft model easily predicts these.
  2. Short outputs: Completions average 20-40 tokens. Even at 82% acceptance rate, the number of rejected tokens per completion is small.
  3. Shared vocabulary: Code tokens are a subset of the general vocabulary, making vocabulary agreement trivial.

Acceptance Rate by Code Type

Code Pattern Acceptance Rate Examples
Closing brackets/syntax 98% }, );, ]
Common patterns 91% for i in range(, if err != nil {
Variable names 78% Context-dependent identifiers
Complex logic 65% Novel algorithmic code
Natural language comments 72% Docstrings, comments
Overall 82% Weighted average

The high acceptance rate for syntactic and common patterns meant the draft model handled the "easy" tokens while the target model's compute was reserved for the "hard" tokens where it added the most value.

Draft Model Architecture Decision

The team considered three draft model options:

Option Size Acceptance Rate Draft Speed Net Speedup
Code Llama 1B 1B 78% 800 tok/s 1.9x
Custom 1B (trained) 1B 82% 800 tok/s 2.1x
Quantized 13B (INT4) 13B INT4 90% 200 tok/s 1.4x

The custom 1B model, trained on the same code corpus as the 13B, provided the best balance of speed and agreement. The quantized 13B had higher acceptance but was too slow to provide meaningful speedup.


Lessons Learned

  1. Latency optimization is a stack, not a single technique. No single optimization was sufficient. The 71% TTFT reduction came from combining 7 optimizations that each contributed 8-53%.

  2. Measure the right thing. P50 latency was misleading—it was acceptable from the start. The user experience was dominated by P95 latency, which was 2x worse. Queuing was the primary P95 bottleneck, which was invisible at P50.

  3. Speculative decoding is transformative for code. The 82% acceptance rate for code (higher than typical natural language at 70-75%) made speculative decoding the single most impactful optimization for generation speed.

  4. Client-side optimizations compound with server-side. Debouncing, prefetching, and request cancellation reduced effective latency beyond what server-side optimizations achieved. The "fastest request is the one you don't make."

  5. vLLM is a non-negotiable upgrade for production. The switch from TGI to vLLM delivered the largest single improvement. PagedAttention and continuous batching are foundational for any production LLM deployment.

  6. Geographic distribution is table stakes for latency-sensitive applications. 30ms of network latency (the difference between same-region and cross-continent) represented 22% of the final P50 TTFT. No amount of model optimization can overcome speed-of-light constraints.


Key Takeaways

  • Inline code completion requires sub-200ms P95 latency, demanding a stack of optimizations rather than any single technique.
  • INT8 quantization + speculative decoding + vLLM forms a powerful optimization trinity, each addressing a different bottleneck.
  • Code's high predictability makes it ideal for speculative decoding, with acceptance rates of 80%+ achievable.
  • Request prioritization and cancellation are critical for interactive applications where stale requests waste resources.
  • Client-side optimizations (debouncing, prefetching, caching) compound with server-side improvements.
  • The total latency budget must account for network, queuing, prefill, and generation; optimizing only one component leaves others as bottlenecks.