Case Study 1: Deploying a Quantized LLM with vLLM
Chapter 33: Inference Optimization and Model Serving
Overview
Organization: FinAssist, a fintech startup providing AI-powered financial advisory chatbots to mid-market wealth management firms. Challenge: The company's MVP used the OpenAI API at a cost of $45,000/month for 3 million daily conversations (averaging 800 tokens per conversation). With projected growth to 15 million daily conversations, API costs would reach $225,000/month, threatening the company's unit economics. Goal: Self-host a competitive LLM with inference costs under $0.001 per conversation while maintaining response quality and achieving sub-500ms time-to-first-token.
Problem Analysis
FinAssist's financial advisory chatbot required:
- Quality: Responses needed to be accurate, compliant with financial regulations, and comparable to GPT-4-class output for the financial domain.
- Latency: Wealth advisors expected streaming responses beginning within 500ms. The chatbot was embedded in their workflow; slow responses disrupted their client meetings.
- Throughput: At peak hours (9 AM - 4 PM EST), traffic reached 500 requests per second.
- Cost: The current API spend of $0.015 per conversation was unsustainable at scale. The target was $0.001 or below.
- Privacy: Financial data could not leave the company's infrastructure. Self-hosting was also a regulatory requirement under SOC 2 compliance.
Benchmark Analysis
The team evaluated several models on their internal financial QA benchmark (2,000 questions with expert-rated answers):
| Model | Accuracy | Avg Latency | Cost/1M tokens | Notes |
|---|---|---|---|---|
| GPT-4 (API) | 92.1% | 1.2s TTFT | $30.00 | Current production |
| Llama-3-70B (FP16) | 89.4% | 2.8s TTFT* | N/A | Requires 4x A100 80GB |
| Llama-3-70B-Instruct (INT4 GPTQ) | 88.1% | 0.4s TTFT* | N/A | Fits on 1x A100 80GB |
| Llama-3-70B-Instruct (INT4 AWQ) | 88.5% | 0.3s TTFT* | N/A | Fits on 1x A100 80GB |
| Llama-3-8B-Instruct (FP16) | 79.2% | 0.15s TTFT* | N/A | Single A100 with room |
* Estimated based on single-request benchmarks; production latency depends on batching.
The Llama-3-70B-Instruct with AWQ INT4 quantization offered the best quality-cost tradeoff: only 3.6% below GPT-4 on domain-specific tasks while fitting on a single A100 80GB GPU.
System Architecture
Quantization Pipeline
The team quantized Llama-3-70B-Instruct using AWQ with their financial domain calibration data:
Step 1: Calibration Data Preparation - Selected 256 representative conversations from production logs. - Ensured coverage of all financial advisory topics: portfolio allocation, tax planning, retirement, risk assessment. - Tokenized to match the model's expected input format.
Step 2: AWQ Quantization
Model: meta-llama/Llama-3-70B-Instruct
Method: AWQ (Activation-Aware Weight Quantization)
Bit Width: 4-bit (INT4)
Group Size: 128
Calibration Samples: 256 financial conversations
Quantization Time: 47 minutes on 1x A100 80GB
Step 3: Quality Validation After quantization, the team re-ran their benchmark:
| Metric | FP16 | AWQ INT4 | Delta |
|---|---|---|---|
| Financial QA Accuracy | 89.4% | 88.5% | -0.9% |
| Compliance Score | 94.2% | 93.8% | -0.4% |
| Fluency Rating (1-5) | 4.6 | 4.5 | -0.1 |
| Perplexity (WikiText-2) | 5.21 | 5.38 | +0.17 |
| Model Size | 140 GB | 36 GB | -74% |
The 0.9% accuracy drop was within acceptable tolerance, and the compliance score remained above the 90% threshold.
Serving Architecture
┌─────────────┐
│ Load │
│ Balancer │
└──────┬──────┘
┌───────────┼───────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ vLLM │ │ vLLM │ │ vLLM │
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ (A100) │ │ (A100) │ │ (A100) │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└───────────┼───────────┘
│
┌──────┴──────┐
│ Monitoring │
│ (Prometheus │
│ + Grafana) │
└─────────────┘
Each vLLM node: - Hosts the AWQ INT4 quantized Llama-3-70B-Instruct model. - Uses PagedAttention for efficient KV cache management. - Implements continuous batching with a maximum batch size of 64. - Exposes an OpenAI-compatible API for drop-in replacement.
vLLM Configuration:
python -m vllm.entrypoints.openai.api_server \
--model /models/llama-3-70b-instruct-awq \
--quantization awq \
--gpu-memory-utilization 0.92 \
--max-model-len 4096 \
--max-num-seqs 64 \
--enable-prefix-caching \
--disable-log-requests
Prefix Caching for System Prompts
FinAssist's chatbot used a 1,200-token system prompt containing financial guidelines, compliance rules, and conversation format instructions. With prefix caching enabled in vLLM, this prompt was processed once and the KV cache was shared across all requests:
- Without prefix caching: 1,200 tokens prefilled per request = ~15ms overhead per request.
- With prefix caching: 1,200 tokens prefilled once, reused for all subsequent requests.
- At 500 requests/second, this saved approximately 7.5 seconds of GPU compute per second across the cluster.
Results
Performance Benchmarks
After deployment, the team conducted a comprehensive benchmark:
| Metric | Target | Achieved | Notes |
|---|---|---|---|
| TTFT (P50) | < 500ms | 180ms | Well under target |
| TTFT (P95) | < 800ms | 420ms | Acceptable |
| TPS per request | > 30 | 45 | Streaming felt smooth |
| Throughput per node | > 150 req/s | 210 req/s | Above expectation |
| GPU memory usage | < 95% | 88% | Headroom for batch spikes |
| Quality (Financial QA) | > 85% | 88.5% | Acceptable |
Cost Analysis
| Cost Component | Monthly (3M conversations) | Monthly (15M conversations) |
|---|---|---|
| API (GPT-4) | $45,000 | $225,000 | |
| Self-hosted (3x A100) | $10,800 | $10,800 | |
| Infrastructure overhead | $2,400 | $3,600 | |
| Engineering (amortized) | $5,000 | $2,000 | |
| Total self-hosted | $18,200** | **$16,400 | |
| Savings | $26,800 (60%)** | **$208,600 (93%) |
At 3M daily conversations, self-hosting saved 60%. At 15M daily conversations, the fixed GPU cost was amortized across more requests, achieving 93% cost reduction.
Cost per conversation: - API: $0.015 - Self-hosted: $0.0006 at 3M/day, $0.00011 at 15M/day
Quality Comparison
An A/B test over 2 weeks with 100,000 conversations compared the self-hosted model against the GPT-4 API:
| Metric | GPT-4 API | Self-Hosted AWQ | Delta |
|---|---|---|---|
| User satisfaction (1-5) | 4.32 | 4.21 | -0.11 |
| Advisor helpfulness rating | 4.18 | 4.09 | -0.09 |
| Compliance violations flagged | 0.3% | 0.4% | +0.1% |
| Conversation length (turns) | 5.8 | 6.1 | +0.3 |
The small quality gap was deemed acceptable given the 93% cost reduction. The slightly longer conversations suggested the self-hosted model occasionally required additional clarification turns, which the team addressed through prompt optimization.
Technical Challenges and Solutions
Challenge 1: Long-Tail Latency Spikes
Problem: While median TTFT was 180ms, occasional spikes reached 2-3 seconds, coinciding with bursts of requests with long prompts.
Solution: Implemented chunked prefill with a chunk size of 512 tokens. Long prompts were split into chunks interleaved with decode steps for existing requests. This capped TTFT spikes at 650ms (P99).
Challenge 2: Memory Pressure Under High Load
Problem: At peak batch sizes (60+ concurrent requests), GPU memory occasionally hit 95%+, causing OOM errors.
Solution: Two-pronged approach:
1. Reduced max-num-seqs from 64 to 48 during peak hours (dynamic configuration).
2. Enabled KV cache INT8 quantization (halving KV cache memory usage).
Combined effect: Peak memory usage dropped to 82%, with no measurable quality impact.
Challenge 3: Model Updates
Problem: Updating the model (e.g., applying new fine-tuning for regulatory changes) required taking nodes offline for 15 minutes to load the new model.
Solution: Implemented blue-green deployment: 1. Load the new model on a standby node while the old model continues serving. 2. Route traffic to the new node once it's ready. 3. Drain and update the old node.
This reduced downtime to near-zero with a brief period of reduced capacity.
Lessons Learned
-
AWQ outperformed GPTQ for this use case. AWQ's domain-agnostic nature meant it generalized well to financial conversations, while GPTQ's calibration-data-dependent optimization showed slightly lower quality on out-of-distribution queries.
-
Prefix caching was the highest-ROI optimization. Given that all conversations shared a 1,200-token system prompt, prefix caching effectively reduced per-request compute by 40%. This was more impactful than any other single optimization.
-
INT4 quantization was the enabler, not the optimizer. The primary value of INT4 was reducing the model from 4x A100s to 1x A100 (4x hardware cost reduction). The throughput improvement from reduced memory bandwidth was a secondary benefit.
-
Monitor P95 and P99 latency, not just P50. Median latency was excellent from day one, but tail latency issues only appeared under high load. The chunked prefill fix was not identified until P99 monitoring was added.
-
The build vs. buy break-even shifts with volume. At 3M conversations/day, the break-even was borderline (60% savings). At 15M conversations/day, self-hosting was overwhelmingly cheaper. The team recommends starting with API at low volume and migrating as volume grows.
Key Takeaways
- AWQ INT4 quantization enables serving 70B-parameter models on single GPUs with minimal quality loss.
- vLLM with PagedAttention and continuous batching provides production-grade serving with 2-4x throughput improvement over naive implementations.
- Prefix caching is extremely effective when requests share long system prompts.
- Self-hosting becomes economically dominant at high request volumes, with 90%+ cost savings achievable.
- Production deployment requires attention to tail latency, memory management under load, and zero-downtime update mechanisms.