Case Study 2: Optimizing Inference Latency for Production

Chapter 33: Inference Optimization and Model Serving

Overview

Organization: QuickCode, a developer tools company providing an AI-powered inline code completion assistant integrated into popular IDEs (VS Code, JetBrains, Neovim). Challenge: Code completion requires extremely low latency—completions must appear within 200ms of the user pausing their typing, or the experience feels sluggish and breaks the developer's flow state. The existing deployment using a 13B parameter model on A100s achieved 350ms P50 latency, which users reported as "noticeably slow." Goal: Reduce P95 latency below 200ms while maintaining completion quality and handling 10,000 concurrent developer sessions at peak.

Problem Analysis

Inline code completion has the most stringent latency requirements of any LLM application:

Metric	Target	Current	Gap
Time to First Token (P50)	< 100ms	180ms	80ms
Time to First Token (P95)	< 200ms	420ms	220ms
Tokens per Second	> 100	55	45
Completion Quality (HumanEval pass@1)	> 45%	48.2%	Met
Concurrent Sessions	10,000	3,000	7,000

The existing architecture: - Model: Custom 13B code model fine-tuned from Code Llama 13B. - Hardware: 8x A100 40GB GPUs across 2 nodes. - Framework: Hugging Face TGI with default settings. - Input: 512-2048 tokens of code context. - Output: 20-80 tokens of completion (short outputs).

Latency Breakdown

The team profiled the inference pipeline to identify bottlenecks:

Component	Time (P50)	Time (P95)	% of Total
Network (client to server)	15ms	45ms	8%
Request queuing	5ms	120ms	3% / 29%
Tokenization	2ms	3ms	1%
Prefill (1024 tokens avg)	45ms	85ms	25%
First decode step	25ms	35ms	14%
KV cache overhead	8ms	20ms	4%
Generation (20 tokens avg)	80ms	112ms	44%

Key findings: 1. Queuing was the P95 killer. At peak load, requests waited 120ms+ in the queue. 2. Prefill was expensive. Code context averaging 1024 tokens required significant compute. 3. Generation speed was inadequate. 55 tokens/sec meant 20 tokens took 360ms. 4. Batch contention. Large batches improved throughput but degraded per-request latency.

Optimization Strategy

The team implemented optimizations in phases, measuring impact after each:

Phase 1: Model Optimization (Week 1-2)

1a. INT8 Quantization

Quantized the 13B model to INT8 using bitsandbytes:

Metric	FP16	INT8	Change
Model Size	26 GB	13 GB	-50%
TTFT (P50)	180ms	125ms	-31%
TPS	55	78	+42%
HumanEval pass@1	48.2%	47.8%	-0.4%

INT8 delivered meaningful improvement but was insufficient to meet the 200ms P95 target.

1b. Speculative Decoding

Trained a 1B parameter draft model on the same code corpus. The draft model shared the 13B model's vocabulary and was architecturally a scaled-down version (12 layers, 16 heads).

Draft model training: - Architecture: 1B parameter transformer (same tokenizer as 13B). - Training data: 50B tokens of code (same distribution as 13B model). - Training time: 3 days on 8x A100. - Draft tokens per step ($K$): 5.

Speculative decoding results:

Metric	INT8 Only	INT8 + Speculative	Change
TTFT (P50)	125ms	125ms	0% (prefill unchanged)
TPS	78	165	+112%
Acceptance Rate	N/A	82%	-
HumanEval pass@1	47.8%	47.8%	0% (exact match guaranteed)

Speculative decoding more than doubled the generation speed. With 82% acceptance rate, the draft model agreed with the target model on 4.1 out of 5 proposed tokens on average.

Phase 2: Serving Optimization (Week 2-3)

2a. Migration to vLLM

Migrated from Hugging Face TGI to vLLM for PagedAttention and continuous batching:

Metric	TGI	vLLM	Change
Max concurrent requests	32	128	+4x
TTFT (P95)	380ms	180ms	-53%
Throughput (tokens/sec total)	4,400	11,200	+155%
GPU memory utilization	72%	91%	+19%

The PagedAttention-based memory management eliminated fragmentation, allowing 4x more concurrent requests. Continuous batching eliminated queuing delays.

2b. Prefix Caching

Code completion requests shared a common prompt template (formatting instructions and language detection). At 200 tokens, this represented 10-20% of typical input. Prefix caching eliminated redundant prefill for this shared portion.

Impact: TTFT reduced by 15-20ms on average.

2c. CUDA Graphs

Enabled CUDA graph capture for the decode phase using torch.compile with mode="reduce-overhead":

# torch.compile reduces kernel launch overhead for the decode loop
model.decode = torch.compile(model.decode, mode="reduce-overhead")

This reduced per-step decode overhead by 30%, as the kernel launch overhead (which dominates at small batch sizes) was eliminated by replaying captured CUDA graphs.

Phase 3: Infrastructure Optimization (Week 3-4)

3a. Geographic Distribution

Deployed inference nodes in 3 regions (US-East, US-West, EU-West) to reduce network latency from 45ms (P95 cross-continent) to 15ms (P95 same-region).

3b. Request Prioritization

Implemented a priority scheduler that: 1. Gave highest priority to the most recently typed request from each user (older requests were likely stale). 2. Canceled in-flight requests when a new request arrived from the same user (the user typed more, making the old completion irrelevant). 3. Shed load during extreme spikes by returning empty completions rather than queuing.

This reduced effective P95 TTFT by 40ms by eliminating stale request processing.

3c. Client-Side Optimization

Speculative prefetch: The IDE plugin prefetched completions as the user typed, predicting likely pause points.
Debouncing: Requests were only sent after a 50ms typing pause to avoid unnecessary calls during fast typing.
Caching: Recent completions were cached client-side and reused if the user typed a prefix match.

Final Results

End-to-End Performance

Metric	Before	After	Target	Met?
TTFT (P50)	180ms	52ms	< 100ms	Yes
TTFT (P95)	420ms	135ms	< 200ms	Yes
TPS per request	55	165	> 100	Yes
Concurrent Sessions	3,000	12,000	10,000	Yes
HumanEval pass@1	48.2%	47.8%	> 45%	Yes
GPU count (total)	8x A100	12x A100*	-	-
Cost per completion	$0.0008 \| $0.0003	< $0.0005	Yes

* 12 GPUs across 3 regions, but each GPU served 3x more requests.

Optimization Impact Breakdown

Optimization	TTFT Impact	TPS Impact	Cost Impact
INT8 quantization	-31%	+42%	-30%
Speculative decoding	0%	+112%	-20%
vLLM migration	-53% (P95)	+155%	-40%
Prefix caching	-12%	0%	-5%
CUDA graphs	-8%	+15%	0%
Geo-distribution	-65% (network)	0%	+20% (more nodes)
Request prioritization	-22% (effective)	0%	0%
Combined	-71% (P50), -68% (P95)	+200%	-63%

User Satisfaction

An internal A/B test with 500 developers over 4 weeks:

Metric	Before	After
"Completions feel instant"	23%	78%
"Completions feel slow"	52%	8%
Completion acceptance rate	28%	41%
Developer NPS	+12	+54

The 13-percentage-point increase in completion acceptance rate was the most impactful business metric: developers were accepting more completions because they arrived fast enough to be useful.

Technical Deep Dive: Speculative Decoding for Code

Why Code Is Ideal for Speculative Decoding

Code completion is particularly well-suited for speculative decoding because:

High predictability: Much of code follows deterministic patterns (closing brackets, function signatures, common idioms). The draft model easily predicts these.
Short outputs: Completions average 20-40 tokens. Even at 82% acceptance rate, the number of rejected tokens per completion is small.
Shared vocabulary: Code tokens are a subset of the general vocabulary, making vocabulary agreement trivial.

Acceptance Rate by Code Type

Code Pattern	Acceptance Rate	Examples
Closing brackets/syntax	98%	`}`, `);`, `]`
Common patterns	91%	`for i in range(`, `if err != nil {`
Variable names	78%	Context-dependent identifiers
Complex logic	65%	Novel algorithmic code
Natural language comments	72%	Docstrings, comments
Overall	82%	Weighted average

The high acceptance rate for syntactic and common patterns meant the draft model handled the "easy" tokens while the target model's compute was reserved for the "hard" tokens where it added the most value.

Draft Model Architecture Decision

The team considered three draft model options:

Option	Size	Acceptance Rate	Draft Speed	Net Speedup
Code Llama 1B	1B	78%	800 tok/s	1.9x
Custom 1B (trained)	1B	82%	800 tok/s	2.1x
Quantized 13B (INT4)	13B INT4	90%	200 tok/s	1.4x

The custom 1B model, trained on the same code corpus as the 13B, provided the best balance of speed and agreement. The quantized 13B had higher acceptance but was too slow to provide meaningful speedup.

Lessons Learned

Latency optimization is a stack, not a single technique. No single optimization was sufficient. The 71% TTFT reduction came from combining 7 optimizations that each contributed 8-53%.
Measure the right thing. P50 latency was misleading—it was acceptable from the start. The user experience was dominated by P95 latency, which was 2x worse. Queuing was the primary P95 bottleneck, which was invisible at P50.
Speculative decoding is transformative for code. The 82% acceptance rate for code (higher than typical natural language at 70-75%) made speculative decoding the single most impactful optimization for generation speed.
Client-side optimizations compound with server-side. Debouncing, prefetching, and request cancellation reduced effective latency beyond what server-side optimizations achieved. The "fastest request is the one you don't make."
vLLM is a non-negotiable upgrade for production. The switch from TGI to vLLM delivered the largest single improvement. PagedAttention and continuous batching are foundational for any production LLM deployment.
Geographic distribution is table stakes for latency-sensitive applications. 30ms of network latency (the difference between same-region and cross-continent) represented 22% of the final P50 TTFT. No amount of model optimization can overcome speed-of-light constraints.

Key Takeaways

Inline code completion requires sub-200ms P95 latency, demanding a stack of optimizations rather than any single technique.
INT8 quantization + speculative decoding + vLLM forms a powerful optimization trinity, each addressing a different bottleneck.
Code's high predictability makes it ideal for speculative decoding, with acceptance rates of 80%+ achievable.
Request prioritization and cancellation are critical for interactive applications where stale requests waste resources.
Client-side optimizations (debouncing, prefetching, caching) compound with server-side improvements.
The total latency budget must account for network, queuing, prefill, and generation; optimizing only one component leaves others as bottlenecks.