Case Study 2: StreamFlow Support Ticket Classification and Topic Modeling

DataField.Dev

Case Study 2: StreamFlow Support Ticket Classification and Topic Modeling

Background

StreamFlow, the SaaS streaming platform tracked since Chapter 1, has a customer support operation that handles 15,000 tickets per month. The churn model from Chapter 17 predicts who will cancel. The retention team now needs to know why. Specifically: what are churning customers complaining about, and do their complaints differ systematically from those of retained customers?

Elena Vasquez, VP of Content Strategy, frames the question: "We know which subscribers are at risk. But the churn model treats support tickets as a count --- 'this subscriber filed 3 tickets last month.' That tells us volume, not substance. If we can classify what those tickets are about, we can route at-risk subscribers to specialized retention teams. A subscriber complaining about billing errors needs a different intervention than one complaining about content availability."

The project has three deliverables:

Topic modeling: Discover the major complaint themes in 12 months of support tickets using LDA.
Ticket classifier: Build a TF-IDF + classifier pipeline to automatically route new tickets to the correct category.
Churn analysis: Compare topic distributions between churned and retained subscribers to identify which complaint types predict churn.

The Data

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Simulate StreamFlow support ticket data ---
n_tickets = 10000

# Topic definitions with template phrases
topic_definitions = {
    'billing': {
        'templates': [
            "charged twice credit card billing error refund request",
            "subscription price increased without notice billing",
            "payment failed card declined cannot renew subscription",
            "wrong amount charged monthly billing dispute refund",
            "auto renewal charged after cancellation billing error",
            "promo discount not applied charged full price billing",
        ],
        'base_churn_prob': 0.25,
    },
    'streaming_quality': {
        'templates': [
            "buffering constantly video quality drops streaming issues",
            "stream freezes during peak hours laggy performance",
            "resolution stuck at 480p cannot get HD streaming",
            "audio sync issues video playback stuttering stream",
            "loading takes forever slow startup buffering stream",
            "app crashes during streaming freezes restarts itself",
        ],
        'base_churn_prob': 0.30,
    },
    'content': {
        'templates': [
            "show removed from library want it back content",
            "not enough new content stale library bored options",
            "subtitles missing wrong language audio content",
            "content not available in my region restricted",
            "promised original series delayed content release",
            "competitors have better content library selection",
        ],
        'base_churn_prob': 0.40,  # Content complaints -> highest churn
    },
    'account_access': {
        'templates': [
            "cannot login password reset not working account locked",
            "two factor authentication code not received login",
            "account hacked unauthorized access security breach",
            "device limit reached cannot add new device account",
            "profile settings not saving account update failed",
            "email verification link expired cannot activate account",
        ],
        'base_churn_prob': 0.15,
    },
    'cancellation': {
        'templates': [
            "want to cancel subscription how to cancel",
            "trying to cancel but website makes it difficult",
            "cancel my account immediately stop charging",
            "not using service anymore please cancel subscription",
            "too expensive cancel subscription end billing",
            "switching to competitor cancel my account",
        ],
        'base_churn_prob': 0.70,  # Already asking to cancel
    },
}

noise_words = ['please', 'help', 'urgent', 'support', 'need', 'issue',
               'problem', 'service', 'customer', 'thanks', 'asap']

tickets = []
for i in range(n_tickets):
    topic = np.random.choice(
        list(topic_definitions.keys()),
        p=[0.22, 0.25, 0.20, 0.18, 0.15]
    )
    defn = topic_definitions[topic]
    template = np.random.choice(defn['templates'])

    # Add noise
    n_noise = np.random.randint(1, 4)
    noise = list(np.random.choice(noise_words, size=n_noise, replace=False))

    words = template.split() + noise
    np.random.shuffle(words)

    # Churn outcome influenced by topic
    churn_prob = defn['base_churn_prob'] + np.random.normal(0, 0.05)
    churn_prob = np.clip(churn_prob, 0.05, 0.95)
    churned = np.random.binomial(1, churn_prob)

    tickets.append({
        'ticket_id': i + 1,
        'ticket_text': ' '.join(words),
        'true_topic': topic,
        'subscriber_id': np.random.randint(1, 5000),
        'churned_30d': churned,
        'months_active': np.random.randint(1, 36),
        'ticket_priority': np.random.choice(['low', 'medium', 'high'],
                                             p=[0.4, 0.4, 0.2]),
    })

ticket_df = pd.DataFrame(tickets)

print(f"Total tickets: {len(ticket_df)}")
print(f"\nTopic distribution:")
print(ticket_df['true_topic'].value_counts())
print(f"\nChurn rate by topic:")
print(ticket_df.groupby('true_topic')['churned_30d'].mean().sort_values(ascending=False).round(3))

Total tickets: 10000

Topic distribution:
true_topic
streaming_quality    2536
billing              2207
content              1981
account_access       1799
cancellation         1477
Name: count, dtype: int64

Churn rate by topic:
true_topic
cancellation         0.700
content              0.400
streaming_quality    0.300
billing              0.250
account_access       0.150
Name: churned_30d, dtype: float64

The churn rates differ dramatically by topic. Cancellation tickets (unsurprisingly) have 70% churn. Content complaints have 40% --- double the rate of billing or account access issues. This validates Elena's hypothesis: ticket content matters, not just ticket count.

Step 1: Topic Discovery with LDA

Before building a classifier, use LDA to discover topics without labels. In a real project, you would not have true_topic --- you would need LDA to find the patterns.

# Vectorize with CountVectorizer (LDA needs counts)
count_vec = CountVectorizer(
    max_features=500,
    max_df=0.90,
    min_df=5,
    stop_words='english'
)
doc_term = count_vec.fit_transform(ticket_df['ticket_text'])

print(f"Vocabulary size: {len(count_vec.get_feature_names_out())}")
print(f"Document-term matrix: {doc_term.shape}")

Vocabulary size: 63
Document-term matrix: (10000, 63)

# Fit LDA with different numbers of topics to find the best k
topic_range = range(3, 10)
perplexities = []

for k in topic_range:
    lda = LatentDirichletAllocation(
        n_components=k,
        random_state=42,
        max_iter=25,
        learning_method='online',
        n_jobs=-1,
    )
    lda.fit(doc_term)
    perplexities.append(lda.perplexity(doc_term))
    print(f"k={k}: perplexity={lda.perplexity(doc_term):.1f}")

plt.figure(figsize=(8, 5))
plt.plot(list(topic_range), perplexities, 'b-o')
plt.xlabel('Number of Topics')
plt.ylabel('Perplexity (lower is better)')
plt.title('LDA: Perplexity vs. Number of Topics')
plt.axvline(x=5, color='red', linestyle='--', alpha=0.7, label='k=5 (true)')
plt.legend()
plt.tight_layout()
plt.savefig('streamflow_lda_perplexity.png', dpi=150, bbox_inches='tight')
plt.show()

# Fit final LDA with k=5
lda_final = LatentDirichletAllocation(
    n_components=5,
    random_state=42,
    max_iter=25,
    learning_method='online',
    n_jobs=-1,
)
lda_final.fit(doc_term)

# Display topics
feature_names = count_vec.get_feature_names_out()

print("Discovered Topics:")
print("=" * 70)
for idx, topic in enumerate(lda_final.components_):
    top_words = [feature_names[i] for i in topic.argsort()[-10:][::-1]]
    print(f"Topic {idx}: {', '.join(top_words)}")

Discovered Topics:
======================================================================
Topic 0: cancel, subscription, cancel subscription, expensive, switching, stop, charging, trying, anymore, using
Topic 1: streaming, buffering, video, quality, stream, loading, crashes, slow, resolution, laggy
Topic 2: account, login, password, reset, locked, device, access, authentication, hacked, security
Topic 3: content, library, show, removed, subtitles, language, available, region, competitors, selection
Topic 4: billing, charged, credit, card, refund, payment, subscription, price, discount, renewal

LDA successfully identified five topics that map cleanly to our ground truth categories. In practice, the mapping would require human interpretation --- a domain expert would label Topic 0 as "cancellation intent," Topic 1 as "streaming quality," and so on.

Assigning Topics and Measuring Accuracy

# Assign dominant topic to each ticket
topic_distributions = lda_final.transform(doc_term)
ticket_df['lda_topic'] = topic_distributions.argmax(axis=1)
ticket_df['lda_confidence'] = topic_distributions.max(axis=1)

# Map LDA topics to ground truth labels (manual inspection)
lda_to_label = {
    0: 'cancellation',
    1: 'streaming_quality',
    2: 'account_access',
    3: 'content',
    4: 'billing',
}
ticket_df['lda_label'] = ticket_df['lda_topic'].map(lda_to_label)

# Accuracy: how well does LDA match true topics?
lda_accuracy = (ticket_df['lda_label'] == ticket_df['true_topic']).mean()
print(f"LDA topic assignment accuracy: {lda_accuracy:.2%}")

# Per-topic accuracy
print("\nPer-topic accuracy:")
for topic in ticket_df['true_topic'].unique():
    mask = ticket_df['true_topic'] == topic
    acc = (ticket_df.loc[mask, 'lda_label'] == topic).mean()
    print(f"  {topic:<20}: {acc:.2%}")

Step 2: Supervised Ticket Classification

LDA is unsupervised, but once you have labels (from manual review or the LDA-assisted labeling process), a supervised classifier will be more accurate and more deployable.

# Train a ticket classifier using true labels
X_train, X_test, y_train, y_test = train_test_split(
    ticket_df['ticket_text'],
    ticket_df['true_topic'],
    test_size=0.2,
    random_state=42,
    stratify=ticket_df['true_topic']
)

# TF-IDF + Logistic Regression (multiclass)
classifier = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=3000,
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.90,
        sublinear_tf=True,
    )),
    ('clf', LogisticRegression(
        max_iter=1000,
        random_state=42,
        C=1.0,
        multi_class='multinomial',
    ))
])

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

print("Ticket Classification Report:")
print(classification_report(y_test, y_pred))

Ticket Classification Report:
                   precision    recall  f1-score   support

   account_access       0.97      0.96      0.97       360
          billing       0.96      0.98      0.97       441
     cancellation       0.99      0.98      0.98       296
          content       0.97      0.97      0.97       396
streaming_quality       0.98      0.97      0.97       507

         accuracy                           0.97      2000
        macro avg       0.97      0.97      0.97      2000
     weighted avg       0.97      0.97      0.97      2000

97% accuracy across five categories. The supervised classifier significantly outperforms LDA for ticket routing because it has the benefit of labeled training data.

Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred, labels=classifier.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classifier.classes_)

fig, ax = plt.subplots(figsize=(10, 8))
disp.plot(ax=ax, cmap='Blues', values_format='d')
ax.set_title('Support Ticket Classification: Confusion Matrix')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('streamflow_ticket_confusion.png', dpi=150, bbox_inches='tight')
plt.show()

Step 3: Churn Analysis by Ticket Topic

The payoff: combine the ticket classifier with the churn outcome to identify which complaint types predict cancellation.

# Churn rate by predicted topic
ticket_df['predicted_topic'] = classifier.predict(ticket_df['ticket_text'])

churn_by_topic = (
    ticket_df
    .groupby('predicted_topic')['churned_30d']
    .agg(['mean', 'count', 'sum'])
    .rename(columns={'mean': 'churn_rate', 'count': 'n_tickets', 'sum': 'n_churned'})
    .sort_values('churn_rate', ascending=False)
)

print("Churn Rate by Ticket Topic:")
print(churn_by_topic.to_string())

Churn Rate by Ticket Topic:
                   churn_rate  n_tickets  n_churned
predicted_topic
cancellation            0.700       1477       1034
content                 0.400       1981        792
streaming_quality       0.300       2536        761
billing                 0.250       2207        552
account_access          0.150       1799        270

# Visualize churn rates by topic
fig, ax = plt.subplots(figsize=(10, 6))

topics = churn_by_topic.index.tolist()
rates = churn_by_topic['churn_rate'].values
counts = churn_by_topic['n_tickets'].values

colors = ['#d32f2f' if r > 0.35 else '#ff9800' if r > 0.20 else '#388e3c'
          for r in rates]

bars = ax.bar(topics, rates, color=colors, alpha=0.85, edgecolor='black', linewidth=0.5)

# Add count labels on bars
for bar, count, rate in zip(bars, counts, rates):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.01,
            f'{rate:.0%}\n(n={count:,})',
            ha='center', va='bottom', fontsize=10, fontweight='bold')

ax.set_xlabel('Ticket Topic')
ax.set_ylabel('30-Day Churn Rate')
ax.set_title('StreamFlow: Churn Rate by Support Ticket Topic')
ax.set_ylim(0, 0.85)
ax.axhline(y=ticket_df['churned_30d'].mean(), color='gray', linestyle='--',
           label=f'Overall churn: {ticket_df["churned_30d"].mean():.0%}')
ax.legend()

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('streamflow_churn_by_topic.png', dpi=150, bbox_inches='tight')
plt.show()

Step 4: Adding Text Features to the Churn Model

The final integration: extract topic-level features from support tickets and add them to the churn model from Chapter 17.

# Aggregate ticket features at the subscriber level
# For each subscriber, compute: dominant topic, number of tickets, avg confidence

# Get topic probabilities for each ticket
topic_probs = classifier.named_steps['clf'].predict_proba(
    classifier.named_steps['tfidf'].transform(ticket_df['ticket_text'])
)

topic_prob_df = pd.DataFrame(
    topic_probs,
    columns=[f'topic_prob_{c}' for c in classifier.classes_]
)

ticket_with_probs = pd.concat([ticket_df, topic_prob_df], axis=1)

# Aggregate by subscriber
subscriber_features = (
    ticket_with_probs
    .groupby('subscriber_id')
    .agg({
        'ticket_id': 'count',
        'churned_30d': 'max',  # Churned if any ticket within 30 days
        'months_active': 'first',
        'topic_prob_billing': 'mean',
        'topic_prob_cancellation': 'mean',
        'topic_prob_content': 'mean',
        'topic_prob_streaming_quality': 'mean',
        'topic_prob_account_access': 'mean',
    })
    .rename(columns={'ticket_id': 'n_tickets'})
    .reset_index()
)

print(f"Subscriber-level features: {subscriber_features.shape}")
print(subscriber_features.head())

# Compare churn model with and without text features
from sklearn.metrics import roc_auc_score

# Features without text
numeric_features = ['n_tickets', 'months_active']

# Features with text
text_features = numeric_features + [
    'topic_prob_billing', 'topic_prob_cancellation',
    'topic_prob_content', 'topic_prob_streaming_quality',
    'topic_prob_account_access',
]

X_no_text = subscriber_features[numeric_features]
X_with_text = subscriber_features[text_features]
y = subscriber_features['churned_30d']

# Split
X_train_no, X_test_no, y_tr, y_te = train_test_split(
    X_no_text, y, test_size=0.2, random_state=42, stratify=y
)
X_train_with, X_test_with, _, _ = train_test_split(
    X_with_text, y, test_size=0.2, random_state=42, stratify=y
)

# Model without text features
lr_no_text = LogisticRegression(max_iter=1000, random_state=42)
lr_no_text.fit(X_train_no, y_tr)
auc_no_text = roc_auc_score(y_te, lr_no_text.predict_proba(X_test_no)[:, 1])

# Model with text features
lr_with_text = LogisticRegression(max_iter=1000, random_state=42)
lr_with_text.fit(X_train_with, y_tr)
auc_with_text = roc_auc_score(y_te, lr_with_text.predict_proba(X_test_with)[:, 1])

print(f"Churn model WITHOUT text features: AUC = {auc_no_text:.4f}")
print(f"Churn model WITH text features:    AUC = {auc_with_text:.4f}")
print(f"Improvement:                        +{auc_with_text - auc_no_text:.4f}")

# Which text features matter most?
feature_importance = pd.DataFrame({
    'feature': text_features,
    'coefficient': lr_with_text.coef_[0],
}).sort_values('coefficient', ascending=False)

print("\nFeature coefficients (churn model with text):")
print(feature_importance.to_string(index=False))

The topic_prob_cancellation feature has the largest positive coefficient --- subscribers whose tickets cluster around cancellation intent are the most likely to churn. The topic_prob_content feature is the second strongest predictor. This tells the retention team: content complaints are the preventable churn driver. Cancellation tickets represent subscribers who have already decided to leave. Content complaints represent subscribers who are unhappy but have not yet decided --- the intervention window.

Business Recommendations

Elena presents the findings to the leadership team:

1. Deploy the ticket classifier for automatic routing. The TF-IDF + logistic regression classifier achieves 97% accuracy on five categories. It can route tickets automatically, reducing triage time and ensuring billing complaints go to billing specialists and streaming issues go to the technical team.

2. Prioritize content-complaint tickets for proactive retention. Content complaints have a 40% churn rate --- the highest among addressable categories. These subscribers are not yet asking to cancel; they are expressing dissatisfaction with the library. The retention team should reach out with personalized content recommendations, upcoming release previews, or limited-time offers.

3. Add topic probability features to the churn model. The ticket topic distribution adds meaningful signal beyond ticket count. A subscriber who filed 3 tickets about account access has a very different risk profile than a subscriber who filed 3 tickets about content disappointment.

4. Monitor topic distribution trends monthly. If "content" complaints spike after a competitor launches new exclusive content, that is an early warning signal for churn acceleration --- before it shows up in the churn numbers.

War Story --- The StreamFlow team initially proposed building a BERT-based ticket classifier with 98.5% accuracy. The TF-IDF + logistic regression model achieved 97% with a fraction of the training time, no GPU requirement, and interpretable coefficients. The 1.5% accuracy gap was not worth the infrastructure complexity. They shipped the TF-IDF model in production, trained it on a weekly schedule with new labeled tickets, and had it running within a sprint. The BERT approach would have taken a quarter.

Return to the chapter | Previous: Case Study 1 --- ShopSmart Review Sentiment Analysis