Case Study 1: Deploying a Quantized LLM with vLLM

Chapter 33: Inference Optimization and Model Serving

Overview

Organization: FinAssist, a fintech startup providing AI-powered financial advisory chatbots to mid-market wealth management firms. Challenge: The company's MVP used the OpenAI API at a cost of $45,000/month for 3 million daily conversations (averaging 800 tokens per conversation). With projected growth to 15 million daily conversations, API costs would reach $225,000/month, threatening the company's unit economics. Goal: Self-host a competitive LLM with inference costs under $0.001 per conversation while maintaining response quality and achieving sub-500ms time-to-first-token.

Problem Analysis

FinAssist's financial advisory chatbot required:

Quality: Responses needed to be accurate, compliant with financial regulations, and comparable to GPT-4-class output for the financial domain.
Latency: Wealth advisors expected streaming responses beginning within 500ms. The chatbot was embedded in their workflow; slow responses disrupted their client meetings.
Throughput: At peak hours (9 AM - 4 PM EST), traffic reached 500 requests per second.
Cost: The current API spend of $0.015 per conversation was unsustainable at scale. The target was $0.001 or below.
Privacy: Financial data could not leave the company's infrastructure. Self-hosting was also a regulatory requirement under SOC 2 compliance.

Benchmark Analysis

The team evaluated several models on their internal financial QA benchmark (2,000 questions with expert-rated answers):

Model	Accuracy	Avg Latency	Cost/1M tokens	Notes
GPT-4 (API)	92.1%	1.2s TTFT	$30.00	Current production
Llama-3-70B (FP16)	89.4%	2.8s TTFT*	N/A	Requires 4x A100 80GB
Llama-3-70B-Instruct (INT4 GPTQ)	88.1%	0.4s TTFT*	N/A	Fits on 1x A100 80GB
Llama-3-70B-Instruct (INT4 AWQ)	88.5%	0.3s TTFT*	N/A	Fits on 1x A100 80GB
Llama-3-8B-Instruct (FP16)	79.2%	0.15s TTFT*	N/A	Single A100 with room

* Estimated based on single-request benchmarks; production latency depends on batching.

The Llama-3-70B-Instruct with AWQ INT4 quantization offered the best quality-cost tradeoff: only 3.6% below GPT-4 on domain-specific tasks while fitting on a single A100 80GB GPU.

System Architecture

Quantization Pipeline

The team quantized Llama-3-70B-Instruct using AWQ with their financial domain calibration data:

Step 1: Calibration Data Preparation - Selected 256 representative conversations from production logs. - Ensured coverage of all financial advisory topics: portfolio allocation, tax planning, retirement, risk assessment. - Tokenized to match the model's expected input format.

Step 2: AWQ Quantization

Model: meta-llama/Llama-3-70B-Instruct
Method: AWQ (Activation-Aware Weight Quantization)
Bit Width: 4-bit (INT4)
Group Size: 128
Calibration Samples: 256 financial conversations
Quantization Time: 47 minutes on 1x A100 80GB

Step 3: Quality Validation After quantization, the team re-ran their benchmark:

Metric	FP16	AWQ INT4	Delta
Financial QA Accuracy	89.4%	88.5%	-0.9%
Compliance Score	94.2%	93.8%	-0.4%
Fluency Rating (1-5)	4.6	4.5	-0.1
Perplexity (WikiText-2)	5.21	5.38	+0.17
Model Size	140 GB	36 GB	-74%

The 0.9% accuracy drop was within acceptable tolerance, and the compliance score remained above the 90% threshold.

Serving Architecture

                         ┌─────────────┐
                         │ Load        │
                         │ Balancer    │
                         └──────┬──────┘
                    ┌───────────┼───────────┐
                    ▼           ▼           ▼
             ┌──────────┐ ┌──────────┐ ┌──────────┐
             │ vLLM     │ │ vLLM     │ │ vLLM     │
             │ Node 1   │ │ Node 2   │ │ Node 3   │
             │ (A100)   │ │ (A100)   │ │ (A100)   │
             └──────────┘ └──────────┘ └──────────┘
                    │           │           │
                    └───────────┼───────────┘
                                │
                         ┌──────┴──────┐
                         │ Monitoring  │
                         │ (Prometheus │
                         │  + Grafana) │
                         └─────────────┘

Each vLLM node: - Hosts the AWQ INT4 quantized Llama-3-70B-Instruct model. - Uses PagedAttention for efficient KV cache management. - Implements continuous batching with a maximum batch size of 64. - Exposes an OpenAI-compatible API for drop-in replacement.

vLLM Configuration:

python -m vllm.entrypoints.openai.api_server \
    --model /models/llama-3-70b-instruct-awq \
    --quantization awq \
    --gpu-memory-utilization 0.92 \
    --max-model-len 4096 \
    --max-num-seqs 64 \
    --enable-prefix-caching \
    --disable-log-requests

Prefix Caching for System Prompts

FinAssist's chatbot used a 1,200-token system prompt containing financial guidelines, compliance rules, and conversation format instructions. With prefix caching enabled in vLLM, this prompt was processed once and the KV cache was shared across all requests:

Without prefix caching: 1,200 tokens prefilled per request = ~15ms overhead per request.
With prefix caching: 1,200 tokens prefilled once, reused for all subsequent requests.
At 500 requests/second, this saved approximately 7.5 seconds of GPU compute per second across the cluster.

Results

Performance Benchmarks

After deployment, the team conducted a comprehensive benchmark:

Metric	Target	Achieved	Notes
TTFT (P50)	< 500ms	180ms	Well under target
TTFT (P95)	< 800ms	420ms	Acceptable
TPS per request	> 30	45	Streaming felt smooth
Throughput per node	> 150 req/s	210 req/s	Above expectation
GPU memory usage	< 95%	88%	Headroom for batch spikes
Quality (Financial QA)	> 85%	88.5%	Acceptable

Cost Analysis

Cost Component	Monthly (3M conversations)	Monthly (15M conversations)
API (GPT-4)	$45,000 \| $225,000
Self-hosted (3x A100)	$10,800 \| $10,800
Infrastructure overhead	$2,400 \| $3,600
Engineering (amortized)	$5,000 \| $2,000
Total self-hosted	$18,200 \| $16,400
Savings	$26,800 (60%) \| $208,600 (93%)

At 3M daily conversations, self-hosting saved 60%. At 15M daily conversations, the fixed GPU cost was amortized across more requests, achieving 93% cost reduction.

Cost per conversation: - API: $0.015 - Self-hosted: $0.0006 at 3M/day, $0.00011 at 15M/day

Quality Comparison

An A/B test over 2 weeks with 100,000 conversations compared the self-hosted model against the GPT-4 API:

Metric	GPT-4 API	Self-Hosted AWQ	Delta
User satisfaction (1-5)	4.32	4.21	-0.11
Advisor helpfulness rating	4.18	4.09	-0.09
Compliance violations flagged	0.3%	0.4%	+0.1%
Conversation length (turns)	5.8	6.1	+0.3

The small quality gap was deemed acceptable given the 93% cost reduction. The slightly longer conversations suggested the self-hosted model occasionally required additional clarification turns, which the team addressed through prompt optimization.

Technical Challenges and Solutions

Challenge 1: Long-Tail Latency Spikes

Problem: While median TTFT was 180ms, occasional spikes reached 2-3 seconds, coinciding with bursts of requests with long prompts.

Solution: Implemented chunked prefill with a chunk size of 512 tokens. Long prompts were split into chunks interleaved with decode steps for existing requests. This capped TTFT spikes at 650ms (P99).

Challenge 2: Memory Pressure Under High Load

Problem: At peak batch sizes (60+ concurrent requests), GPU memory occasionally hit 95%+, causing OOM errors.

Solution: Two-pronged approach: 1. Reduced max-num-seqs from 64 to 48 during peak hours (dynamic configuration). 2. Enabled KV cache INT8 quantization (halving KV cache memory usage).

Combined effect: Peak memory usage dropped to 82%, with no measurable quality impact.

Challenge 3: Model Updates

Problem: Updating the model (e.g., applying new fine-tuning for regulatory changes) required taking nodes offline for 15 minutes to load the new model.

Solution: Implemented blue-green deployment: 1. Load the new model on a standby node while the old model continues serving. 2. Route traffic to the new node once it's ready. 3. Drain and update the old node.

This reduced downtime to near-zero with a brief period of reduced capacity.

Lessons Learned

AWQ outperformed GPTQ for this use case. AWQ's domain-agnostic nature meant it generalized well to financial conversations, while GPTQ's calibration-data-dependent optimization showed slightly lower quality on out-of-distribution queries.
Prefix caching was the highest-ROI optimization. Given that all conversations shared a 1,200-token system prompt, prefix caching effectively reduced per-request compute by 40%. This was more impactful than any other single optimization.
INT4 quantization was the enabler, not the optimizer. The primary value of INT4 was reducing the model from 4x A100s to 1x A100 (4x hardware cost reduction). The throughput improvement from reduced memory bandwidth was a secondary benefit.
Monitor P95 and P99 latency, not just P50. Median latency was excellent from day one, but tail latency issues only appeared under high load. The chunked prefill fix was not identified until P99 monitoring was added.
The build vs. buy break-even shifts with volume. At 3M conversations/day, the break-even was borderline (60% savings). At 15M conversations/day, self-hosting was overwhelmingly cheaper. The team recommends starting with API at low volume and migrating as volume grows.

Key Takeaways

AWQ INT4 quantization enables serving 70B-parameter models on single GPUs with minimal quality loss.
vLLM with PagedAttention and continuous batching provides production-grade serving with 2-4x throughput improvement over naive implementations.
Prefix caching is extremely effective when requests share long system prompts.
Self-hosting becomes economically dominant at high request volumes, with 90%+ cost savings achievable.
Production deployment requires attention to tail latency, memory management under load, and zero-downtime update mechanisms.