Case Study 2: Monitoring an LLM in Production
Overview
Company: LegalAI Inc., a legal technology startup providing an LLM-powered contract review assistant to 200 law firms. Challenge: The LLM application experienced intermittent quality degradation, unpredictable costs, and occasional safety incidents (generating legally inaccurate advice). The team had no systematic monitoring or evaluation framework. Goal: Implement comprehensive LLMOps monitoring that maintains answer quality above 90%, keeps costs within budget, and eliminates safety incidents.
Problem Analysis
After 6 months in production, several issues surfaced:
- Quality inconsistency: Users reported that answer quality varied significantly depending on the query type. Contract clause analysis was excellent, but risk assessment queries were unreliable.
- Cost spikes: Monthly API costs varied between $8K and $25K with no clear correlation to usage volume. Some queries generated unnecessarily verbose responses.
- Safety incidents: Three incidents where the system generated legally incorrect advice that could have had serious consequences if followed.
- Prompt fragility: After an upstream model update (GPT-4 to GPT-4 Turbo), several prompt templates produced worse results, but this was discovered weeks later by user complaints.
- No evaluation pipeline: Quality was assessed solely through user feedback, which was sparse and biased toward negative experiences.
Incident Timeline
| Date | Incident | Impact | Detection |
|---|---|---|---|
| Month 2 | Hallucinated case citation | 1 user affected | User report (2 days) |
| Month 4 | Incorrect jurisdiction analysis | 15 users affected | Support tickets (5 days) |
| Month 5 | GPT-4 Turbo migration quality drop | All users affected | User complaints (12 days) |
Monitoring Architecture
Component Overview
User Query
|
v
[Input Guardrails]
|--> PII detection and redaction
|--> Topic classification (in-scope check)
|--> Query complexity classification
|
v
[RAG Pipeline]
|--> Retrieval monitoring (latency, relevance scores)
|--> Context assembly monitoring (token counts)
|
v
[LLM Generation]
|--> Token usage tracking
|--> Latency monitoring
|--> Cost computation
|
v
[Output Guardrails]
|--> Legal accuracy check (citation verification)
|--> Hallucination detection (NLI against context)
|--> Confidence scoring
|--> Disclaimer injection for low-confidence answers
|
v
[Response to User]
|
v
[Async Evaluation]
|--> LLM-as-judge quality scoring
|--> Faithfulness evaluation
|--> Weekly human evaluation sample
Metrics Dashboard
| Category | Metric | Alert Threshold |
|---|---|---|
| Quality | LLM-judge score (1-5) | Mean < 3.5 |
| Faithfulness score | < 0.85 | |
| User satisfaction (thumbs up/down) | < 75% positive | |
| Safety | Hallucinated citations detected | > 0 per day |
| Guardrail rejection rate | > 15% (may indicate system issue) | |
| Low-confidence response rate | > 20% | |
| Cost | Cost per query (p50) | > $0.15 |
| Daily spend | > $1,000 | |
| Tokens per response (p95) | > 2,000 | |
| Latency | End-to-end latency (p95) | > 8 seconds |
| Retrieval latency (p95) | > 500ms | |
| Generation latency (p95) | > 6 seconds | |
| Volume | Queries per hour | < 10 (potential outage) |
| Error rate | > 2% |
Implementation
Phase 1: Output Guardrails (Week 1-2)
The highest-priority fix was preventing safety incidents:
- Citation verification: Extract all legal citations from responses and verify against a curated database. Flag unrecognized citations.
- NLI-based faithfulness: Check each factual claim against retrieved context using a DeBERTa NLI model. Flag unsupported claims.
- Confidence scoring: Use the probability of the LLM's initial tokens as a confidence proxy. Low-confidence responses receive a disclaimer.
Result: Citation hallucinations dropped from ~2/week to ~0.1/week. Unsupported claims were flagged in 4.2% of responses (previously invisible).
Phase 2: Automated Evaluation (Week 3-4)
An asynchronous evaluation pipeline was deployed:
- Every response is scored by a separate LLM judge on 4 dimensions: correctness, completeness, clarity, and citation quality (each 1-5 scale).
- A random 5% sample is queued for weekly human review.
- Scores are tracked over time with 7-day rolling averages.
- Score breakdowns by query type enable targeted improvement.
Result: Mean judge score improved from 3.2 to 4.1 over 8 weeks as the team used evaluation data to iterate on prompts and retrieval.
Phase 3: Cost Optimization (Week 5-6)
Cost monitoring revealed two primary issues:
- Verbose system prompts: The system prompt was 1,200 tokens long. A rewrite reduced it to 450 tokens without quality loss, saving 25% on input costs.
- Unnecessary retrieval: Simple questions (e.g., "What is a force majeure clause?") were retrieving 10 context chunks. A query classifier now skips retrieval for definitional queries, saving 40% on those queries.
- Response length control: Setting max_tokens appropriately for each query type reduced output tokens by 30%.
Result: Monthly costs stabilized at $9-11K (from $8-25K range), a 50% reduction in average spend.
Phase 4: Prompt Management (Week 7-8)
A prompt versioning system was deployed:
- Each prompt template is versioned with semantic versioning (major.minor.patch).
- A/B testing compares prompt variants on live traffic with automatic rollback if quality drops.
- Model migration testing: before switching to a new model version, all prompt templates are evaluated on the benchmark set.
Result: The GPT-4 Turbo migration issue would now be caught in pre-deployment testing within hours.
Results
Quality Metrics
| Metric | Before | After | Target |
|---|---|---|---|
| LLM-judge score (mean) | 3.2 | 4.1 | > 3.5 |
| Faithfulness | 0.78 | 0.92 | > 0.85 |
| User satisfaction | 68% | 87% | > 80% |
| Safety incidents per quarter | 2-3 | 0 | 0 |
Operational Metrics
| Metric | Before | After |
|---|---|---|
| Monthly API cost | $8-25K | $9-11K | |
| Time to detect quality issues | 5-12 days | < 4 hours |
| Prompt deployment confidence | Low (manual testing) | High (automated evaluation) |
| Citation accuracy | ~92% | ~99.5% |
Key Lessons
-
Safety guardrails are non-negotiable for high-stakes domains. The NLI-based faithfulness check and citation verification should have been deployed from day one. In legal, medical, and financial applications, a single hallucination can cause real harm.
-
LLM-as-judge evaluation enables continuous quality monitoring. Human evaluation cannot scale to every response, but automated LLM judges provide a useful quality signal for trend detection and regression alerts.
-
Cost monitoring requires query-level granularity. Aggregate cost metrics hide the reality that different query types have vastly different cost profiles. Per-query-type tracking enabled targeted optimizations.
-
Prompt versioning prevents silent regressions. When the upstream model changes, all prompts must be re-evaluated. A version control system with automated benchmarking catches issues before they reach users.
-
Monitoring must cover the full pipeline. Monitoring only the LLM output misses retrieval failures, guardrail issues, and infrastructure problems. End-to-end observability requires tracking every component.
-
User feedback alone is insufficient. Only 3% of users provided explicit feedback, biased toward extreme experiences. Automated evaluation provides comprehensive coverage.
Code Reference
The complete implementation is available in code/case-study-code.py.