Case Study 2: StreamRec Text Embeddings — 1D CNN vs. Bag-of-Words

DataField.Dev

Case Study 2: StreamRec Text Embeddings — 1D CNN vs. Bag-of-Words

Context

StreamRec's recommendation team is building content embeddings from item descriptions. Each item on the platform (articles, videos, podcasts) has a text description — a title plus a synopsis, typically 20-100 tokens. The team needs to convert these descriptions into dense vector representations that capture semantic similarity: items with similar descriptions should have similar embeddings.

The current system uses a bag-of-words (BoW) approach: TF-IDF vectors over the vocabulary, reduced to 128 dimensions via SVD. This captures which words appear in a description but ignores word order entirely. The team hypothesizes that a 1D CNN — which captures local n-gram patterns — will produce better embeddings, particularly for distinguishing between items whose descriptions contain the same words in different arrangements.

This case study implements both approaches, trains them on a category prediction task (as a proxy for semantic quality), and evaluates the resulting embeddings for nearest-neighbor retrieval.

The Data Pipeline

StreamRec's content catalog has approximately 200,000 items across 20 categories (news, drama, comedy, documentary, technology, etc.). We simulate this with category-specific vocabulary distributions.

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from typing import Dict, List, Tuple
from collections import Counter


def build_synthetic_catalog(
    n_items: int = 50000,
    n_categories: int = 20,
    vocab_size: int = 10000,
    max_length: int = 64,
    seed: int = 42,
) -> Dict[str, np.ndarray]:
    """Build a synthetic StreamRec content catalog.

    Each category has a characteristic vocabulary distribution,
    simulating how technology articles use different words than
    comedy show descriptions. Within-category variation is high
    enough that classification is nontrivial.

    Args:
        n_items: Number of items to generate.
        n_categories: Number of content categories.
        vocab_size: Size of the token vocabulary.
        max_length: Maximum sequence length.
        seed: Random seed.

    Returns:
        Dictionary with 'tokens' (n_items, max_length),
        'lengths' (n_items,), and 'categories' (n_items,).
    """
    rng = np.random.RandomState(seed)

    # Each category has ~100 signature tokens and ~50 secondary tokens
    category_primary = {}
    category_secondary = {}
    used_tokens = set()
    for cat in range(n_categories):
        available = list(set(range(2, vocab_size)) - used_tokens)
        primary = rng.choice(available, size=80, replace=False)
        used_tokens.update(primary)
        secondary = rng.choice(
            list(set(range(2, vocab_size)) - used_tokens), size=40, replace=False
        )
        category_primary[cat] = primary
        category_secondary[cat] = secondary

    tokens = np.zeros((n_items, max_length), dtype=np.int64)
    lengths = np.zeros(n_items, dtype=np.int64)
    categories = np.zeros(n_items, dtype=np.int64)

    for i in range(n_items):
        cat = rng.randint(0, n_categories)
        length = rng.randint(15, max_length + 1)

        # Composition: 45% primary, 20% secondary, 35% random
        n_primary = int(0.45 * length)
        n_secondary = int(0.20 * length)
        n_random = length - n_primary - n_secondary

        seq = np.concatenate([
            rng.choice(category_primary[cat], size=n_primary),
            rng.choice(category_secondary[cat], size=n_secondary),
            rng.randint(2, vocab_size, size=n_random),
        ])
        rng.shuffle(seq)

        tokens[i, :length] = seq[:max_length]
        lengths[i] = min(length, max_length)
        categories[i] = cat

    return {"tokens": tokens, "lengths": lengths, "categories": categories}


class CatalogDataset(Dataset):
    """PyTorch dataset wrapper for the StreamRec catalog."""

    def __init__(self, tokens: np.ndarray, categories: np.ndarray) -> None:
        self.tokens = torch.tensor(tokens, dtype=torch.long)
        self.categories = torch.tensor(categories, dtype=torch.long)

    def __len__(self) -> int:
        return len(self.categories)

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, int]:
        return self.tokens[idx], self.categories[idx].item()

Approach 1: Bag-of-Words + TF-IDF + SVD

The BoW baseline ignores token order entirely. It represents each description as a TF-IDF vector over the vocabulary, then reduces dimensionality with truncated SVD.

def build_bow_embeddings(
    tokens: np.ndarray,
    embedding_dim: int = 128,
) -> np.ndarray:
    """Build bag-of-words embeddings via TF-IDF + SVD.

    Converts token sequences to "documents" (space-separated token
    strings), computes TF-IDF, then reduces to embedding_dim via SVD.

    Args:
        tokens: Token matrix of shape (n_items, max_length).
        embedding_dim: Target embedding dimension.

    Returns:
        Embedding matrix of shape (n_items, embedding_dim).
    """
    # Convert token sequences to "documents"
    docs = []
    for row in tokens:
        doc = " ".join(str(t) for t in row if t > 0)
        docs.append(doc)

    # TF-IDF
    vectorizer = TfidfVectorizer(max_features=5000)
    tfidf_matrix = vectorizer.fit_transform(docs)

    # Dimensionality reduction via SVD
    svd = TruncatedSVD(n_components=embedding_dim, random_state=42)
    embeddings = svd.fit_transform(tfidf_matrix)

    # L2 normalize
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True) + 1e-8
    embeddings = embeddings / norms

    explained_var = svd.explained_variance_ratio_.sum()
    print(f"BoW SVD: {embedding_dim} components explain {explained_var:.1%} of variance")

    return embeddings

Approach 2: 1D CNN Embeddings

The 1D CNN captures local n-gram patterns through convolutional filters at multiple scales.

class TextCNNEmbedder(nn.Module):
    """1D CNN for text embedding extraction.

    Multiple parallel Conv1d branches with different kernel sizes
    capture n-gram patterns at different scales. Global max pooling
    produces a fixed-size embedding regardless of input length.

    Args:
        vocab_size: Size of the vocabulary.
        embedding_dim: Token embedding dimension.
        num_filters: Number of filters per kernel size.
        kernel_sizes: List of kernel sizes.
        output_dim: Final embedding dimension (after projection).
        dropout: Dropout probability.
    """

    def __init__(
        self,
        vocab_size: int = 10000,
        embedding_dim: int = 128,
        num_filters: int = 128,
        kernel_sizes: List[int] = [2, 3, 4, 5],
        output_dim: int = 128,
        dropout: float = 0.3,
    ) -> None:
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        self.convs = nn.ModuleList([
            nn.Sequential(
                nn.Conv1d(embedding_dim, num_filters, k, padding=k // 2),
                nn.BatchNorm1d(num_filters),
                nn.ReLU(),
            )
            for k in kernel_sizes
        ])
        self.dropout = nn.Dropout(dropout)

        concat_dim = num_filters * len(kernel_sizes)
        self.projection = nn.Sequential(
            nn.Linear(concat_dim, output_dim),
            nn.ReLU(),
            nn.Linear(output_dim, output_dim),
        )

    def get_embedding(self, x: torch.Tensor) -> torch.Tensor:
        """Extract text embedding.

        Args:
            x: Token indices of shape (batch, seq_len).

        Returns:
            L2-normalized embedding of shape (batch, output_dim).
        """
        emb = self.embedding(x).transpose(1, 2)  # (batch, emb_dim, seq_len)

        pooled = []
        for conv in self.convs:
            h = conv(emb)                # (batch, num_filters, seq_len')
            h = h.max(dim=2).values      # (batch, num_filters)
            pooled.append(h)

        concat = torch.cat(pooled, dim=1)  # (batch, num_filters * n_kernels)
        concat = self.dropout(concat)
        out = self.projection(concat)

        # L2 normalize for cosine similarity
        out = out / (out.norm(dim=1, keepdim=True) + 1e-8)
        return out

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.get_embedding(x)


class EmbeddingClassifier(nn.Module):
    """Wrapper that adds a classification head to the CNN embedder.

    Training on category prediction is a proxy task that encourages
    the embeddings to capture semantic similarity.

    Args:
        embedder: TextCNNEmbedder to train.
        num_classes: Number of categories.
    """

    def __init__(self, embedder: TextCNNEmbedder, num_classes: int) -> None:
        super().__init__()
        self.embedder = embedder
        self.classifier = nn.Linear(
            embedder.projection[-1].out_features, num_classes
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        emb = self.embedder.get_embedding(x)
        return self.classifier(emb)

Training and Comparison

def train_cnn_embedder(
    model: EmbeddingClassifier,
    train_loader: DataLoader,
    val_loader: DataLoader,
    epochs: int = 25,
    lr: float = 1e-3,
) -> Dict[str, float]:
    """Train the CNN embedder via category classification.

    Args:
        model: EmbeddingClassifier wrapping a TextCNNEmbedder.
        train_loader: Training data loader.
        val_loader: Validation data loader.
        epochs: Training epochs.
        lr: Learning rate.

    Returns:
        Final train and validation accuracy.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

    for epoch in range(epochs):
        model.train()
        for tokens, labels in train_loader:
            tokens, labels = tokens.to(device), labels.to(device)
            optimizer.zero_grad()
            loss = criterion(model(tokens), labels)
            loss.backward()
            optimizer.step()
        scheduler.step()

    # Final evaluation
    model.eval()
    results = {}
    for name, loader in [("train", train_loader), ("val", val_loader)]:
        correct, total = 0, 0
        with torch.no_grad():
            for tokens, labels in loader:
                tokens, labels = tokens.to(device), labels.to(device)
                preds = model(tokens).argmax(1)
                correct += (preds == labels).sum().item()
                total += labels.size(0)
        results[f"{name}_acc"] = correct / total

    return results


def extract_cnn_embeddings(
    embedder: TextCNNEmbedder,
    tokens: np.ndarray,
    batch_size: int = 512,
) -> np.ndarray:
    """Extract CNN embeddings for the full catalog.

    Args:
        embedder: Trained TextCNNEmbedder.
        tokens: Token matrix of shape (n_items, max_length).
        batch_size: Batch size for inference.

    Returns:
        Embedding matrix of shape (n_items, output_dim).
    """
    device = next(embedder.parameters()).device
    embedder.eval()
    embeddings = []
    with torch.no_grad():
        for i in range(0, len(tokens), batch_size):
            batch = torch.tensor(
                tokens[i:i+batch_size], dtype=torch.long
            ).to(device)
            emb = embedder.get_embedding(batch).cpu().numpy()
            embeddings.append(emb)
    return np.concatenate(embeddings, axis=0)


def evaluate_retrieval(
    embeddings: np.ndarray,
    categories: np.ndarray,
    n_queries: int = 100,
    k: int = 10,
    seed: int = 42,
) -> Dict[str, float]:
    """Evaluate embedding quality via nearest-neighbor retrieval.

    For each query item, find the k nearest neighbors by cosine
    similarity and compute precision@k (fraction of neighbors
    in the same category).

    Args:
        embeddings: Embedding matrix (n_items, dim).
        categories: Category labels (n_items,).
        n_queries: Number of query items to evaluate.
        k: Number of neighbors to retrieve.
        seed: Random seed for query selection.

    Returns:
        Dictionary with mean precision@k and per-category breakdown.
    """
    rng = np.random.RandomState(seed)
    query_indices = rng.choice(len(categories), size=n_queries, replace=False)

    precisions = []
    for qi in query_indices:
        query_emb = embeddings[qi:qi+1]
        sims = cosine_similarity(query_emb, embeddings)[0]
        sims[qi] = -1  # Exclude self

        top_k = np.argsort(sims)[-k:]
        n_same_category = (categories[top_k] == categories[qi]).sum()
        precisions.append(n_same_category / k)

    return {
        "mean_precision_at_k": np.mean(precisions),
        "std_precision_at_k": np.std(precisions),
    }


# ---- Run the comparison ----

# Generate catalog
catalog = build_synthetic_catalog(n_items=50000, n_categories=20)
tokens = catalog["tokens"]
categories = catalog["categories"]

# Split
n_train = int(0.8 * len(categories))
train_tokens, val_tokens = tokens[:n_train], tokens[n_train:]
train_cats, val_cats = categories[:n_train], categories[n_train:]

# Approach 1: BoW embeddings
print("=== Bag-of-Words Baseline ===")
bow_embeddings = build_bow_embeddings(tokens, embedding_dim=128)

# Train a logistic regression classifier on BoW for fair comparison
clf = LogisticRegression(max_iter=500, random_state=42)
clf.fit(bow_embeddings[:n_train], train_cats)
bow_train_acc = clf.score(bow_embeddings[:n_train], train_cats)
bow_val_acc = clf.score(bow_embeddings[n_train:], val_cats)
print(f"BoW classification — Train: {bow_train_acc:.4f}, Val: {bow_val_acc:.4f}")

bow_retrieval = evaluate_retrieval(bow_embeddings, categories, k=10)
print(f"BoW retrieval P@10: {bow_retrieval['mean_precision_at_k']:.4f}")

# Approach 2: 1D CNN embeddings
print("\n=== 1D CNN Embeddings ===")
dataset = CatalogDataset(tokens, categories)
train_ds, val_ds = random_split(
    dataset, [n_train, len(dataset) - n_train],
    generator=torch.Generator().manual_seed(42)
)
train_loader = DataLoader(train_ds, batch_size=256, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=256, shuffle=False)

embedder = TextCNNEmbedder(
    vocab_size=10000, embedding_dim=128, num_filters=128,
    kernel_sizes=[2, 3, 4, 5], output_dim=128
)
classifier = EmbeddingClassifier(embedder, num_classes=20)
cnn_results = train_cnn_embedder(classifier, train_loader, val_loader, epochs=25)
print(f"CNN classification — Train: {cnn_results['train_acc']:.4f}, Val: {cnn_results['val_acc']:.4f}")

cnn_embeddings = extract_cnn_embeddings(embedder, tokens)
cnn_retrieval = evaluate_retrieval(cnn_embeddings, categories, k=10)
print(f"CNN retrieval P@10: {cnn_retrieval['mean_precision_at_k']:.4f}")

# Summary
print("\n=== Summary ===")
print(f"{'Metric':<30} {'BoW':>10} {'1D CNN':>10}")
print("-" * 52)
print(f"{'Val Classification Accuracy':<30} {bow_val_acc:>10.4f} {cnn_results['val_acc']:>10.4f}")
print(f"{'Retrieval P@10':<30} {bow_retrieval['mean_precision_at_k']:>10.4f} {cnn_retrieval['mean_precision_at_k']:>10.4f}")

Analysis

Why the CNN Wins

The 1D CNN outperforms BoW for two reasons:

N-gram patterns. The CNN detects token sequences (bigrams, trigrams, 4-grams, 5-grams) that are characteristic of each category. A bag-of-words model treats "not recommended" and "recommended not" identically — the CNN distinguishes them. Even in our simplified simulation where token order within categories is random, the CNN's ability to detect co-occurrence patterns in local windows gives it an advantage.
Learned nonlinear features. The BoW pipeline is linear: TF-IDF produces sparse vectors, SVD finds the best linear subspace, and logistic regression is a linear classifier. The CNN composes nonlinear features through ReLU activations, allowing it to model complex interactions between tokens.

When BoW Might Win

The BoW approach has advantages in specific scenarios:

Very short texts (< 5 tokens): There are too few tokens for n-gram patterns to be meaningful. Average embedding or TF-IDF may suffice.
Vocabulary dominance: When category membership is determined entirely by which words appear (not their arrangement), BoW captures the signal with no wasted capacity.
Cold start with no training data: TF-IDF + SVD requires no labeled data — it is unsupervised. The CNN requires a proxy task (category prediction) with labels.

Integration with the Recommendation System

The CNN embeddings become an input feature in StreamRec's recommendation model. The pipeline is:

Train the TextCNNEmbedder on category prediction (or a contrastive objective — see Chapter 13).
Extract embeddings for all items in the catalog. Store in a feature store.
At recommendation time, look up the pre-computed item embedding and feed it, alongside user features and interaction features, into the ranking model.
Periodically retrain the embedder as the catalog grows and new categories emerge.

The embedding dimension (128) is small enough to serve at scale — adding 128 floats per item to the ranking model's input is negligible compared to the computational cost of the ranking model itself.

Lessons for Practice

The proxy task matters. We trained on category classification, but a contrastive objective (items clicked by the same user in the same session should have similar embeddings) would produce embeddings more aligned with the recommendation task. Chapter 13 covers contrastive learning in depth.
BoW is a strong baseline. Never skip it. In many production systems, the simple TF-IDF + SVD pipeline is within 5% of a neural approach and costs orders of magnitude less to train and serve.
Embedding quality should be evaluated on the downstream task. Category classification accuracy is a proxy. The real test is whether the recommendation model produces better rankings when CNN embeddings replace BoW features. Always measure the end-to-end metric.
Lightweight models for feature extraction. The TextCNNEmbedder has ~5M parameters and processes thousands of items per second on CPU. For a feature extraction subcomponent in a larger system, this efficiency matters more than squeezing the last 0.5% of accuracy from a transformer.