Case Study 2: Monitoring an LLM in Production

Overview

Company: LegalAI Inc., a legal technology startup providing an LLM-powered contract review assistant to 200 law firms. Challenge: The LLM application experienced intermittent quality degradation, unpredictable costs, and occasional safety incidents (generating legally inaccurate advice). The team had no systematic monitoring or evaluation framework. Goal: Implement comprehensive LLMOps monitoring that maintains answer quality above 90%, keeps costs within budget, and eliminates safety incidents.

Problem Analysis

After 6 months in production, several issues surfaced:

Quality inconsistency: Users reported that answer quality varied significantly depending on the query type. Contract clause analysis was excellent, but risk assessment queries were unreliable.
Cost spikes: Monthly API costs varied between $8K and $25K with no clear correlation to usage volume. Some queries generated unnecessarily verbose responses.
Safety incidents: Three incidents where the system generated legally incorrect advice that could have had serious consequences if followed.
Prompt fragility: After an upstream model update (GPT-4 to GPT-4 Turbo), several prompt templates produced worse results, but this was discovered weeks later by user complaints.
No evaluation pipeline: Quality was assessed solely through user feedback, which was sparse and biased toward negative experiences.

Incident Timeline

Date	Incident	Impact	Detection
Month 2	Hallucinated case citation	1 user affected	User report (2 days)
Month 4	Incorrect jurisdiction analysis	15 users affected	Support tickets (5 days)
Month 5	GPT-4 Turbo migration quality drop	All users affected	User complaints (12 days)

Monitoring Architecture

Component Overview

User Query
    |
    v
[Input Guardrails]
    |--> PII detection and redaction
    |--> Topic classification (in-scope check)
    |--> Query complexity classification
    |
    v
[RAG Pipeline]
    |--> Retrieval monitoring (latency, relevance scores)
    |--> Context assembly monitoring (token counts)
    |
    v
[LLM Generation]
    |--> Token usage tracking
    |--> Latency monitoring
    |--> Cost computation
    |
    v
[Output Guardrails]
    |--> Legal accuracy check (citation verification)
    |--> Hallucination detection (NLI against context)
    |--> Confidence scoring
    |--> Disclaimer injection for low-confidence answers
    |
    v
[Response to User]
    |
    v
[Async Evaluation]
    |--> LLM-as-judge quality scoring
    |--> Faithfulness evaluation
    |--> Weekly human evaluation sample

Metrics Dashboard

Category	Metric	Alert Threshold
Quality	LLM-judge score (1-5)	Mean < 3.5
	Faithfulness score	< 0.85
	User satisfaction (thumbs up/down)	< 75% positive
Safety	Hallucinated citations detected	> 0 per day
	Guardrail rejection rate	> 15% (may indicate system issue)
	Low-confidence response rate	> 20%
Cost	Cost per query (p50)	> $0.15
	Daily spend	> $1,000
	Tokens per response (p95)	> 2,000
Latency	End-to-end latency (p95)	> 8 seconds
	Retrieval latency (p95)	> 500ms
	Generation latency (p95)	> 6 seconds
Volume	Queries per hour	< 10 (potential outage)
	Error rate	> 2%

Implementation

Phase 1: Output Guardrails (Week 1-2)

The highest-priority fix was preventing safety incidents:

Citation verification: Extract all legal citations from responses and verify against a curated database. Flag unrecognized citations.
NLI-based faithfulness: Check each factual claim against retrieved context using a DeBERTa NLI model. Flag unsupported claims.
Confidence scoring: Use the probability of the LLM's initial tokens as a confidence proxy. Low-confidence responses receive a disclaimer.

Result: Citation hallucinations dropped from ~2/week to ~0.1/week. Unsupported claims were flagged in 4.2% of responses (previously invisible).

Phase 2: Automated Evaluation (Week 3-4)

An asynchronous evaluation pipeline was deployed:

Every response is scored by a separate LLM judge on 4 dimensions: correctness, completeness, clarity, and citation quality (each 1-5 scale).
A random 5% sample is queued for weekly human review.
Scores are tracked over time with 7-day rolling averages.
Score breakdowns by query type enable targeted improvement.

Result: Mean judge score improved from 3.2 to 4.1 over 8 weeks as the team used evaluation data to iterate on prompts and retrieval.

Phase 3: Cost Optimization (Week 5-6)

Cost monitoring revealed two primary issues:

Verbose system prompts: The system prompt was 1,200 tokens long. A rewrite reduced it to 450 tokens without quality loss, saving 25% on input costs.
Unnecessary retrieval: Simple questions (e.g., "What is a force majeure clause?") were retrieving 10 context chunks. A query classifier now skips retrieval for definitional queries, saving 40% on those queries.
Response length control: Setting max_tokens appropriately for each query type reduced output tokens by 30%.

Result: Monthly costs stabilized at $9-11K (from $8-25K range), a 50% reduction in average spend.

Phase 4: Prompt Management (Week 7-8)

A prompt versioning system was deployed:

Each prompt template is versioned with semantic versioning (major.minor.patch).
A/B testing compares prompt variants on live traffic with automatic rollback if quality drops.
Model migration testing: before switching to a new model version, all prompt templates are evaluated on the benchmark set.

Result: The GPT-4 Turbo migration issue would now be caught in pre-deployment testing within hours.

Results

Quality Metrics

Metric	Before	After	Target
LLM-judge score (mean)	3.2	4.1	> 3.5
Faithfulness	0.78	0.92	> 0.85
User satisfaction	68%	87%	> 80%
Safety incidents per quarter	2-3	0	0

Operational Metrics

Metric	Before	After
Monthly API cost	$8-25K \| $9-11K
Time to detect quality issues	5-12 days	< 4 hours
Prompt deployment confidence	Low (manual testing)	High (automated evaluation)
Citation accuracy	~92%	~99.5%

Key Lessons

Safety guardrails are non-negotiable for high-stakes domains. The NLI-based faithfulness check and citation verification should have been deployed from day one. In legal, medical, and financial applications, a single hallucination can cause real harm.
LLM-as-judge evaluation enables continuous quality monitoring. Human evaluation cannot scale to every response, but automated LLM judges provide a useful quality signal for trend detection and regression alerts.
Cost monitoring requires query-level granularity. Aggregate cost metrics hide the reality that different query types have vastly different cost profiles. Per-query-type tracking enabled targeted optimizations.
Prompt versioning prevents silent regressions. When the upstream model changes, all prompts must be re-evaluated. A version control system with automated benchmarking catches issues before they reach users.
Monitoring must cover the full pipeline. Monitoring only the LLM output misses retrieval failures, guardrail issues, and infrastructure problems. End-to-end observability requires tracking every component.
User feedback alone is insufficient. Only 3% of users provided explicit feedback, biased toward extreme experiences. Automated evaluation provides comprehensive coverage.

Code Reference

The complete implementation is available in code/case-study-code.py.