Case Study 1: Raj's Document Processor — Batch Analysis with the API

Background

Raj Patel is a senior software engineer at a B2B SaaS company. The customer success team has accumulated eighteen months of customer feedback — 500+ support tickets, NPS survey responses, and customer interview transcripts — in various formats across three systems. A strategic planning initiative requires a comprehensive analysis of this feedback to identify the top product friction points and the most-requested features.

The customer success director asked for the analysis in two weeks. The team's initial estimate for manual analysis: six to eight weeks of analyst time to read, categorize, and synthesize 500+ documents.

Raj volunteered to build a solution.

The Problem Decomposition

Raj's first step was not to write code. It was to understand exactly what the analysis needed to produce. After a 45-minute meeting with the customer success director, he had clarity on the deliverable:

  1. A categorized dataset of all feedback items, with each item tagged by: category (bug report, feature request, usability complaint, praise, billing concern, integration request), product area (onboarding, dashboard, API, mobile, billing, reporting), severity (critical/high/medium/low), and sentiment.

  2. A frequency analysis: which categories and product areas appear most often?

  3. A representative sample: for each product area-category combination, five example verbatim quotes that best represent that theme.

  4. A synthesis narrative: a 1,000-word executive summary covering the top five friction points and top five feature requests, with specific evidence.

The customer success director had one critical requirement: she needed to be able to trust the analysis. "I'm going to present this to the CEO. If the categorization is wrong, the whole thing falls apart."

This requirement shaped Raj's architecture significantly.

Data Preparation

The 500+ documents were in three formats: - 287 Zendesk support tickets (CSV export) - 143 NPS survey responses (CSV export) - 82 customer interview transcripts (Google Docs exports as plain text)

Raj spent four hours standardizing the data into a single format:

import csv
import json
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional
import re

@dataclass
class FeedbackItem:
    """Standardized feedback item format."""
    id: str
    source: str          # zendesk | nps | interview
    date: str
    customer_id: Optional[str]
    customer_segment: Optional[str]   # enterprise | mid_market | smb
    content: str
    content_length: int

def load_zendesk_tickets(filepath: str) -> list[FeedbackItem]:
    """Load and standardize Zendesk ticket CSV."""
    items = []
    with open(filepath, "r", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            # Combine subject and description
            content = f"Subject: {row.get('subject', '')}\n\n{row.get('description', '')}"
            # Remove HTML tags if present
            content = re.sub(r'<[^>]+>', '', content).strip()

            if len(content) > 50:  # Skip trivially short items
                items.append(FeedbackItem(
                    id=f"ZD-{row.get('ticket_id', 'unknown')}",
                    source="zendesk",
                    date=row.get('created_at', ''),
                    customer_id=row.get('organization_id'),
                    customer_segment=row.get('cf_segment'),
                    content=content,
                    content_length=len(content)
                ))
    return items

def load_nps_responses(filepath: str) -> list[FeedbackItem]:
    """Load and standardize NPS survey CSV."""
    items = []
    with open(filepath, "r", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            score = row.get('score', '')
            comment = row.get('comment', '').strip()

            if not comment:
                continue

            content = f"NPS Score: {score}/10\n\nComment: {comment}"
            items.append(FeedbackItem(
                id=f"NPS-{row.get('response_id', 'unknown')}",
                source="nps",
                date=row.get('submitted_at', ''),
                customer_id=row.get('customer_id'),
                customer_segment=row.get('segment'),
                content=content,
                content_length=len(content)
            ))
    return items

After loading and standardizing, Raj had 497 usable feedback items (some were too short or had no meaningful content after cleaning). He saved them to a single JSON file.

The Classification Architecture

Given the customer success director's trust requirement, Raj designed a classification pipeline with two AI steps and one human validation step:

Step 1: Individual item classification — Classify each of the 497 items using a structured prompt with claude-haiku (fast and economical for this mechanical task).

Step 2: Human validation sample — Present the customer success director with a random sample of 50 classified items and ask her to mark any misclassifications. This step was manual by design.

Step 3: Recalibration (if needed) — If the validation sample showed systematic errors in specific categories, revise the classification prompt and rerun those items.

Step 4: Synthesis — With validated classifications, run the frequency analysis and generate the synthesis narrative using claude-opus-4-6.

import anthropic
import json
import time
import random
from pathlib import Path
from datetime import datetime
from anthropic import RateLimitError

client = anthropic.Anthropic()

CLASSIFICATION_SYSTEM_PROMPT = """You are a product analyst at a B2B SaaS company.
You classify customer feedback with precision.
Always respond with valid JSON. Never add explanatory text outside the JSON object.
Base classifications strictly on what the feedback explicitly states."""

def classify_item(item: dict) -> dict:
    """Classify a single feedback item."""
    prompt = f"""Classify this customer feedback item.

CONTENT:
{item['content']}

SOURCE: {item['source']}

Respond with a JSON object containing exactly these fields:
{{
  "category": "bug_report|feature_request|usability_complaint|praise|billing_concern|integration_request|general_inquiry",
  "product_area": "onboarding|dashboard|api|mobile|billing|reporting|integrations|performance|other",
  "severity": "critical|high|medium|low",
  "sentiment": "positive|neutral|negative",
  "key_issue": "one sentence describing the specific issue or request",
  "notable_quote": "the most quotable 1-2 sentences from the feedback (exact text)",
  "confidence": 0.0 to 1.0
}}

Classification rules:
- category: classify by the PRIMARY intent of the feedback
- severity: critical = blocking work or causing data loss; high = significant impact; medium = notable but workaround exists; low = minor or cosmetic
- confidence: your confidence in the classification accuracy"""

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=400,
        system=CLASSIFICATION_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": prompt}]
    )

    try:
        classification = json.loads(response.content[0].text)
    except json.JSONDecodeError:
        # Parse failure — flag for manual review
        classification = {
            "category": "parse_error",
            "product_area": "unknown",
            "severity": "unknown",
            "sentiment": "unknown",
            "key_issue": "Classification parsing failed",
            "notable_quote": "",
            "confidence": 0.0
        }

    classification["tokens_used"] = (
        response.usage.input_tokens + response.usage.output_tokens
    )
    return classification


def run_classification_batch(
    items: list[dict],
    output_file: str = "classifications.json",
    checkpoint_file: str = "classify_checkpoint.json"
) -> list[dict]:
    """Run the full classification batch with checkpointing."""
    checkpoint_path = Path(checkpoint_file)
    completed_ids = set()
    results = []

    if checkpoint_path.exists():
        checkpoint = json.loads(checkpoint_path.read_text())
        completed_ids = set(checkpoint["completed_ids"])
        results = checkpoint["results"]
        print(f"Resuming: {len(completed_ids)}/{len(items)} already classified")

    remaining = [item for item in items if item["id"] not in completed_ids]
    total = len(items)

    for i, item in enumerate(remaining, 1):
        current_num = len(completed_ids) + i
        if current_num % 25 == 0:
            print(f"Progress: {current_num}/{total} items classified")

        # Retry loop with backoff
        for attempt in range(3):
            try:
                classification = classify_item(item)
                result = {
                    "id": item["id"],
                    "source": item["source"],
                    "date": item["date"],
                    "customer_segment": item.get("customer_segment"),
                    "content_preview": item["content"][:200],
                    "classification": classification
                }
                results.append(result)
                completed_ids.add(item["id"])
                break

            except RateLimitError:
                wait = 2 ** attempt
                print(f"Rate limited at item {current_num}. Waiting {wait}s...")
                time.sleep(wait)
                if attempt == 2:
                    print(f"  Failed after 3 attempts: {item['id']}")

        # Checkpoint every 50 items
        if current_num % 50 == 0:
            checkpoint_data = {
                "completed_ids": list(completed_ids),
                "results": results,
                "last_updated": datetime.now().isoformat()
            }
            checkpoint_path.write_text(json.dumps(checkpoint_data, indent=2))

        time.sleep(0.3)  # Gentle rate limiting

    # Save final results
    Path(output_file).write_text(json.dumps(results, indent=2))

    # Clean up checkpoint
    if checkpoint_path.exists():
        checkpoint_path.unlink()

    # Report stats
    total_tokens = sum(r["classification"].get("tokens_used", 0) for r in results)
    parse_errors = sum(1 for r in results if r["classification"]["category"] == "parse_error")
    low_confidence = sum(
        1 for r in results if r["classification"].get("confidence", 1.0) < 0.7
    )

    print(f"\nClassification complete:")
    print(f"  Items classified: {len(results)}")
    print(f"  Parse errors: {parse_errors}")
    print(f"  Low confidence (<0.7): {low_confidence}")
    print(f"  Total tokens: {total_tokens:,}")
    print(f"  Estimated cost (Haiku): ${total_tokens * 0.0000005:.2f}")

    return results

Human Validation

After running the classification batch (which took 34 minutes for 497 items), Raj extracted a random sample of 50 items and formatted them for the customer success director's review:

def generate_validation_sample(
    classified_items: list[dict],
    sample_size: int = 50,
    output_file: str = "validation_sample.csv"
) -> list[dict]:
    """Generate a validation sample for human review."""
    sample = random.sample(classified_items, min(sample_size, len(classified_items)))

    import csv
    with open(output_file, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=[
            "id", "source", "content_preview",
            "ai_category", "ai_product_area", "ai_severity",
            "ai_key_issue", "ai_confidence",
            "human_category_override", "human_notes"
        ])
        writer.writeheader()
        for item in sample:
            cls = item["classification"]
            writer.writerow({
                "id": item["id"],
                "source": item["source"],
                "content_preview": item["content_preview"],
                "ai_category": cls["category"],
                "ai_product_area": cls["product_area"],
                "ai_severity": cls["severity"],
                "ai_key_issue": cls["key_issue"],
                "ai_confidence": cls["confidence"],
                "human_category_override": "",
                "human_notes": ""
            })

    print(f"Validation sample saved to {output_file}")
    print("Instructions for reviewer:")
    print("- Fill 'human_category_override' only where AI is wrong")
    print("- Add notes for any patterns you notice")
    return sample

The customer success director reviewed the 50-item sample in 90 minutes. She found four misclassifications (8% error rate), all in the same pattern: the AI was categorizing general usability complaints about the onboarding flow as "usability_complaint" when they should be "feature_request" (because the users were implicitly requesting a better onboarding experience).

Raj updated the classification prompt with an explicit rule: "If a usability complaint implies a desired improvement, classify as feature_request rather than usability_complaint." He reran the 87 items in the onboarding product area. Validation rate improved to 98%.

Synthesis Generation

With validated classifications, Raj ran the synthesis phase:

def generate_synthesis(
    classified_items: list[dict],
    output_file: str = "executive_synthesis.md"
) -> str:
    """Generate executive synthesis from classified feedback."""

    # Frequency analysis
    from collections import Counter

    categories = Counter(item["classification"]["category"] for item in classified_items)
    product_areas = Counter(item["classification"]["product_area"] for item in classified_items)
    by_severity = Counter(item["classification"]["severity"] for item in classified_items)

    # Get top themes
    top_categories = categories.most_common(5)
    top_areas = product_areas.most_common(5)

    # Collect notable quotes by area
    area_quotes = {}
    for item in classified_items:
        area = item["classification"]["product_area"]
        quote = item["classification"].get("notable_quote", "")
        if quote and area not in area_quotes:
            area_quotes[area] = []
        if quote:
            area_quotes[area].append(quote)

    # Build synthesis input
    frequency_summary = f"""FEEDBACK FREQUENCY ANALYSIS
Total items analyzed: {len(classified_items)}

By Category:
{chr(10).join(f"  {cat}: {count} ({count/len(classified_items)*100:.1f}%)" for cat, count in top_categories)}

By Product Area:
{chr(10).join(f"  {area}: {count} ({count/len(classified_items)*100:.1f}%)" for area, count in top_areas)}

By Severity:
{chr(10).join(f"  {sev}: {count}" for sev, count in by_severity.most_common())}

SAMPLE QUOTES BY AREA:
{chr(10).join(f"{area.upper()}:{chr(10)}" + chr(10).join(f'  "{q}"' for q in quotes[:3]) for area, quotes in list(area_quotes.items())[:5])}"""

    synthesis_prompt = f"""Based on this customer feedback analysis, write an executive synthesis.

{frequency_summary}

Write a 1,000-word executive synthesis for the CEO covering:
1. Overview: what the 497 feedback items reveal at a high level
2. Top 5 friction points: specific, evidence-backed product problems affecting customer satisfaction
3. Top 5 feature requests: most-requested capabilities with frequency context
4. Critical items: any severity=critical issues requiring immediate attention
5. Segment patterns: do enterprise vs. SMB customers have meaningfully different concerns?
6. Recommended priorities: based purely on the data, what should the product team focus on first?

Write in clear executive prose. Be specific — cite specific features, flows, or capabilities by name.
Support every claim with data from the analysis. No filler. No vague generalizations."""

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        system=(
            "You are a senior product analyst writing for a CEO. "
            "Be direct, specific, and evidence-based. "
            "Every claim must be supported by the data provided."
        ),
        messages=[{"role": "user", "content": synthesis_prompt}]
    )

    synthesis = response.content[0].text

    # Save with frequency tables
    full_output = f"# Customer Feedback Executive Synthesis\n\n"
    full_output += f"*Based on {len(classified_items)} feedback items | Generated {datetime.now().strftime('%Y-%m-%d')}*\n\n"
    full_output += "---\n\n"
    full_output += synthesis

    Path(output_file).write_text(full_output)
    print(f"Synthesis saved to {output_file}")
    print(f"Synthesis tokens: {response.usage.input_tokens + response.usage.output_tokens:,}")

    return synthesis

Results

Total time from data standardization to delivered analysis: 3.5 days. - Day 1: Data standardization and cleaning (4 hours) - Day 1-2: Building and testing the classification pipeline (6 hours) - Day 2: Running classification batch (34 minutes of compute time; 2 hours of Raj's time for setup and monitoring) - Day 2-3: Human validation (90 minutes of customer success director's time; 2 hours of Raj's time for recalibration) - Day 3: Synthesis generation and final review (2 hours)

Original estimate for manual analysis: 6-8 weeks. Actual time: 3.5 days, with approximately 14 hours of total human time (Raj + director).

Total API cost: $14.73 (classification: $2.81, synthesis and recalibration: $11.92).

The customer success director presented the synthesis to the CEO on day four. The CEO requested three follow-up analyses: one filtered to enterprise customers only, one comparing the first versus second nine months to identify emerging trends, and one focused specifically on integration complaints. Raj ran all three in an afternoon — because the data was already cleaned, standardized, and classified, the additional analyses required only new synthesis generation prompts.

Lessons Learned

Build the human validation step in from the start. The 8% initial error rate was not a failure — it was expected and planned for. The validation step caught the systematic error before it propagated into the synthesis and destroyed the analysis's credibility. Raj notes: "If I hadn't built in the validation step, I would have confidently delivered wrong conclusions."

Data standardization is not a distraction. The four hours spent standardizing data into a consistent format paid off in every subsequent step. Inconsistent data formats would have made the classification prompts more complex and the results less reliable.

Cheap models for classification, expensive models for synthesis. The Haiku/Opus split was the right call. Classification is a pattern-matching task where Haiku's speed and economy are advantages. Synthesis requires the full reasoning capability of Opus. The combined cost of $14.73 would have been approximately $120 if Opus had been used for everything.

Checkpointing is not optional. The classification batch ran for 34 minutes. Raj had experienced batch jobs failing at the 80% mark before. The checkpointing implementation meant that a failure at any point would lose at most 50 items of progress.