Case Study 1: Building a Customer Support Chatbot

A Complete Chatbot with RAG, Conversation History, and Escalation

The Scenario

NovaTech is a mid-size electronics company that sells smart home devices — thermostats, security cameras, and lighting systems. Their customer support team handles 2,000 tickets per day, primarily about installation, troubleshooting, and returns. Average response time is 4 hours, and customer satisfaction scores are declining.

The engineering team is tasked with building an AI-powered support chatbot that can handle the most common customer queries instantly, escalate complex issues to human agents with full context, and reduce the average response time from 4 hours to under 30 seconds for the 60% of queries that follow well-known patterns.

This case study traces the full development process from architecture to deployment.


Step 1: Defining the Scope

The team begins by analyzing 10,000 historical support tickets to understand what customers actually ask about. They find that queries fall into five main categories:

Category Percentage Complexity
Installation and setup 35% Low to medium
Troubleshooting 25% Medium to high
Returns and refunds 20% Low (policy-based)
Product information 12% Low
Account and billing 8% High (requires system access)

The team decides to target the first four categories for AI automation, covering 92% of tickets. Account and billing queries require access to internal systems and involve sensitive financial data, so they will be immediately routed to human agents.

Design Decision

Starting with the highest-volume, lowest-complexity categories maximizes impact while minimizing risk. The chatbot does not need to handle every possible query — it needs to handle the most common queries well and gracefully escalate everything else.


Step 2: Building the Knowledge Base

The chatbot needs domain-specific knowledge. NovaTech has three sources of information:

  1. Product documentation: Installation guides, user manuals, and specification sheets for all 15 products.
  2. Support knowledge base: 500 articles written by the support team covering common issues and solutions.
  3. Return policy documents: Official policies on returns, warranties, and replacements.

The team builds a RAG pipeline to make this knowledge accessible to the chatbot.

Document preparation: Each document is cleaned to remove boilerplate headers, footers, and navigation elements. Product documentation is tagged with product name, category, and document type. Knowledge base articles are tagged with topic, product, and resolution type.

Chunking strategy: The team uses document-aware chunking that respects the natural structure of each document type:

def chunk_support_article(article: dict) -> list[dict]:
    """Chunk a support article preserving Q&A structure."""
    chunks = []
    # Keep title and problem description together
    header_chunk = f"Title: {article['title']}\n"
    header_chunk += f"Problem: {article['problem_description']}\n"
    header_chunk += f"Product: {article['product']}\n"
    header_chunk += f"Category: {article['category']}"
    chunks.append({
        "content": header_chunk,
        "metadata": {
            "source": "knowledge_base",
            "article_id": article["id"],
            "chunk_type": "header",
            "product": article["product"]
        }
    })

    # Each solution step becomes its own chunk with context
    for i, step in enumerate(article["solution_steps"]):
        step_chunk = f"Title: {article['title']}\n"
        step_chunk += f"Product: {article['product']}\n"
        step_chunk += f"Solution Step {i + 1}: {step}"
        chunks.append({
            "content": step_chunk,
            "metadata": {
                "source": "knowledge_base",
                "article_id": article["id"],
                "chunk_type": "solution_step",
                "step_number": i + 1,
                "product": article["product"]
            }
        })

    return chunks

Embedding and indexing: The team uses OpenAI's text-embedding-3-small model to generate embeddings for each chunk. They store the embeddings in pgvector (a PostgreSQL extension) since they already run PostgreSQL for their main application. This avoids adding a separate vector database to their infrastructure.

The final index contains 4,200 chunks from 515 source documents, totaling approximately 2.1 million characters of content.


Step 3: Designing the Conversation System

The chatbot's conversation system has four components:

System prompt: The team crafts a detailed persona and instruction set:

SYSTEM_PROMPT = """You are Nova, a customer support assistant for NovaTech smart home devices.

PERSONALITY:
- Warm, patient, and empathetic
- Use the customer's name when known
- Acknowledge frustration before jumping to solutions
- Keep responses concise (under 150 words) unless the customer asks for detail

CAPABILITIES:
- Answer questions about NovaTech products using the provided knowledge base
- Guide customers through installation and troubleshooting steps
- Explain return and warranty policies
- Look up order status when the customer provides an order number

RULES:
- ONLY answer questions using information from the retrieved knowledge base context
- If the knowledge base does not contain the answer, say "I don't have specific
  information about that. Let me connect you with a specialist."
- Never guess at product specifications or troubleshooting steps
- For safety-related issues (electrical, fire, water damage), IMMEDIATELY
  escalate to a human agent
- If the customer mentions they are a business/enterprise customer, escalate to
  the business support team

CITATION:
- When answering from the knowledge base, mention the relevant article or
  document (e.g., "According to the SmartThermo 3000 installation guide...")
"""

Conversation manager: The team implements a summarization-based history manager that keeps the last 10 messages verbatim and summarizes older messages. The summary preserves key facts: the customer's name, their product, the issue being discussed, and any solutions already attempted.

Product context injection: When the chatbot identifies which product the customer is asking about, it automatically retrieves and injects product-specific context (key specifications, common issues, related articles) into the conversation.

Escalation engine: The system monitors the conversation for escalation triggers:

ESCALATION_TRIGGERS = {
    "explicit_request": {
        "patterns": ["talk to a human", "speak to someone", "real person",
                     "transfer me", "agent please", "supervisor"],
        "action": "immediate_handoff"
    },
    "safety_concern": {
        "patterns": ["smoke", "fire", "burning", "electrical shock",
                     "sparking", "water damage", "flooding"],
        "action": "urgent_handoff",
        "priority": "high"
    },
    "repeated_failure": {
        "condition": "user_reported_failure_count >= 3",
        "action": "offer_handoff"
    },
    "negative_sentiment": {
        "condition": "sentiment_score < -0.6 for 2 consecutive messages",
        "action": "offer_handoff"
    },
    "enterprise_customer": {
        "patterns": ["business account", "enterprise", "bulk order",
                     "company account"],
        "action": "route_to_business_support"
    }
}

Step 4: Implementing the RAG Pipeline

The retrieval pipeline runs on every user message that is not a simple greeting or follow-up acknowledgment:

  1. Query classification: A fast classifier (Claude Haiku) determines whether the message needs retrieval or can be answered from conversation context alone.
  2. Query transformation: If the user's message is vague ("it's not working"), the system combines it with conversation context to form a more specific query ("SmartThermo 3000 not connecting to WiFi after firmware update").
  3. Retrieval: The transformed query is embedded and used to search the vector database. The system retrieves the top 5 chunks, filtered by product if a product has been identified in the conversation.
  4. Re-ranking: A lightweight re-ranker scores the retrieved chunks by relevance to the specific question, not just general similarity.
  5. Generation: The re-ranked chunks are provided to Claude Sonnet as context, along with the conversation history and system prompt.

The full retrieval pipeline adds approximately 300-500 milliseconds to the response time, which is acceptable since the generation step takes 1-3 seconds regardless.


Step 5: Quality Assurance

Before launch, the team builds a comprehensive evaluation framework.

Test dataset: They create 200 test questions from real customer interactions, each with a "gold standard" answer reviewed by senior support agents. The test set covers all product categories and difficulty levels.

Automated evaluation: Each test is evaluated on four dimensions: - Accuracy: Does the answer match the gold standard on key facts? Measured by AI-based comparison. - Groundedness: Is every claim in the answer supported by the retrieved documents? Measured by the groundedness checker from Section 39.7. - Helpfulness: Does the answer actually address the customer's question? Measured by AI scoring on a 1-5 scale. - Safety: Does the answer avoid harmful advice, especially for installation and troubleshooting? Checked by rule-based filters and AI safety evaluation.

Baseline results: On the initial test, the chatbot scores: - Accuracy: 87% (174/200 questions answered correctly) - Groundedness: 94% (188/200 answers fully supported by sources) - Helpfulness: 4.2/5.0 average - Safety: 100% (no unsafe recommendations detected)

The team identifies the 26 failing questions and categorizes the failures: - 12 failures: Insufficient information in the knowledge base (missing articles) - 8 failures: Poor chunking that split key information across chunks - 4 failures: Ambiguous queries where the chatbot chose the wrong product - 2 failures: Hallucinated troubleshooting steps not in any document

For the 12 missing-information failures, they write new knowledge base articles. For the 8 chunking failures, they adjust their chunking strategy to keep related information together. For the 4 ambiguity failures, they implement a clarification flow where the chatbot asks which product the customer is referring to. For the 2 hallucination failures, they strengthen the system prompt's instruction to only use information from retrieved documents.

After these fixes, the scores improve to: Accuracy 95%, Groundedness 98%, Helpfulness 4.5/5, Safety 100%.


Step 6: Cost Modeling

The team estimates costs for their expected volume:

  • Daily volume: 2,000 conversations, average 6 turns each = 12,000 AI API calls per day.
  • Token usage per call: ~1,500 input tokens (system prompt + context + history) + ~200 output tokens (concise responses).
  • Model: Claude 3.5 Sonnet for generation, Claude 3.5 Haiku for classification and query transformation.

Monthly cost estimate: - Sonnet calls (generation): 360,000 calls/month x 1,500 input tokens x ($3.00/1M) + 360,000 x 200 output tokens x ($15.00/1M) = $1,620 + $1,080 = $2,700 - Haiku calls (classification): 360,000 calls/month x 500 input tokens x ($0.80/1M) + 360,000 x 50 output tokens x ($4.00/1M) = $144 + $72 = $216 - Embedding calls: 360,000 queries x $0.02/1K tokens = approximately $108 - Total estimated monthly cost: approximately $3,024

Compared to the cost of human agents handling these same 2,000 daily conversations (estimated at $45,000/month in agent salaries for the equivalent capacity), the AI chatbot represents a 93% cost reduction for the queries it can handle.

The team implements caching for common queries (estimated 20% cache hit rate) and uses prompt caching for the system prompt, further reducing costs by approximately $500/month.


Step 7: Deployment and Monitoring

The chatbot is deployed behind a feature flag, initially available to 10% of customers (200 conversations per day). The team monitors:

  • Resolution rate: What percentage of conversations are resolved without human escalation? Target: 60%.
  • Escalation rate: What percentage are escalated? Acceptable: up to 40%.
  • Customer satisfaction: Post-conversation survey scores. Target: 4.0/5.0 or higher.
  • Response time: Time to first response. Target: under 3 seconds.
  • Cost per conversation: Actual vs. estimated. Budget: under $0.10 per conversation.

Week 1 results (200 conversations/day): - Resolution rate: 58% - Escalation rate: 42% - Customer satisfaction: 4.1/5.0 - Response time: 2.1 seconds (median) - Cost per conversation: $0.08

The team identifies that many escalations come from customers who need to check order status — a feature the chatbot can handle but is not yet connected to the order management system. After integrating the order lookup tool, resolution rate improves to 67%.

Week 4 results (full rollout, 2,000 conversations/day): - Resolution rate: 71% - Escalation rate: 29% - Customer satisfaction: 4.3/5.0 - Response time: 2.4 seconds (median) - Cost per conversation: $0.07 (caching reduces costs at scale)


Lessons Learned

  1. Data quality drives chatbot quality. The biggest improvements came not from prompt engineering but from improving the knowledge base — filling gaps, fixing chunking, and ensuring information was accurate and complete.

  2. Escalation is a feature, not a failure. Customers appreciate a chatbot that knows its limits and quickly connects them to a human. The worst experience is a chatbot that keeps trying to help when it clearly cannot.

  3. Start narrow, then expand. Launching with a subset of query types and a subset of traffic allowed the team to iterate safely. They caught and fixed issues at 200 conversations per day instead of 2,000.

  4. Monitor relentlessly. The quality dashboard caught a regression in Week 3 when a knowledge base update accidentally deleted several articles. Automated monitoring detected the accuracy drop within hours, and the articles were restored the same day.

  5. Cost modeling must be continuous. Initial estimates were accurate, but changes in conversation length (users had longer conversations than expected) and the addition of new features (order lookup) changed the cost profile. Monthly cost reviews are essential.


The complete implementation code for this case study is available in code/case-study-code.py.