Case Study 1: StreamRec Candidate Retrieval — Exact vs. Approximate Nearest Neighbors at Scale

Context

StreamRec's recommendation system follows the standard two-stage architecture: a retrieval stage that selects a few hundred candidate items from the full catalog, followed by a ranking stage that scores and orders those candidates using a richer model. The retrieval stage must be fast enough that the total end-to-end latency — retrieval, feature enrichment, ranking, business logic, serialization — stays within the 100ms p99 SLA.

When StreamRec launched, the catalog contained 15,000 items. The retrieval stage used brute-force inner product search between the user embedding (from a matrix factorization model trained in Chapter 1) and all item embeddings: 15,000 inner products at 64 dimensions, completing in approximately 0.05ms. No one questioned whether this was fast enough.

Eighteen months later, the catalog has grown to 200,000 items. The platform has expanded from articles to include videos, podcasts, and interactive content. The embedding dimension has been increased to 256 to accommodate richer item representations (from a two-tower neural network). The brute-force retrieval time has increased to 4.2ms at p50 and 8.7ms at p99 — still within budget, but consuming 9% of the total latency allocation. The infrastructure team projects the catalog will reach 1 million items within a year.

The retrieval team is tasked with evaluating approximate nearest neighbor methods to prepare for the catalog growth. The evaluation criteria:

  1. Recall@200: Fraction of true top-200 items that appear in the retrieved set. The downstream ranker needs high-quality candidates.
  2. Query latency (p99): Must be under 15ms (the retrieval budget within the 100ms total).
  3. Index build time: Must complete within 30 minutes (the reindexing window in the daily pipeline).
  4. Memory footprint: Must fit on the serving instances (64 GB RAM per instance).
  5. Operational complexity: How difficult is the index to tune, monitor, and debug?

The Evaluation

The team evaluates three approaches at three catalog scales: current (200K), projected (1M), and stress-test (10M).

import numpy as np
import time
from typing import Dict, List, Tuple
from dataclasses import dataclass

@dataclass
class RetrievalBenchmark:
    """Results of a retrieval method benchmark."""
    method: str
    n_items: int
    recall_at_200: float
    p50_ms: float
    p99_ms: float
    build_time_s: float
    memory_mb: float

def evaluate_brute_force(
    items: np.ndarray,
    queries: np.ndarray,
    k: int = 200
) -> RetrievalBenchmark:
    """Evaluate brute-force exact retrieval.

    Args:
        items: Item embedding matrix, shape (n_items, dim).
        queries: Query embedding matrix, shape (n_queries, dim).
        k: Number of candidates to retrieve.

    Returns:
        RetrievalBenchmark with performance metrics.
    """
    n_items, dim = items.shape
    n_queries = queries.shape[0]

    # Build: no index needed
    build_time = 0.0

    # Query
    latencies = []
    for i in range(n_queries):
        start = time.perf_counter()
        scores = queries[i] @ items.T
        top_k = np.argpartition(-scores, k)[:k]
        latencies.append((time.perf_counter() - start) * 1000)

    # Memory: just the item matrix
    memory_mb = items.nbytes / (1024 ** 2)

    return RetrievalBenchmark(
        method="Brute Force",
        n_items=n_items,
        recall_at_200=1.0,  # Exact by definition
        p50_ms=np.percentile(latencies, 50),
        p99_ms=np.percentile(latencies, 99),
        build_time_s=build_time,
        memory_mb=memory_mb,
    )

def evaluate_ivf(
    items: np.ndarray,
    queries: np.ndarray,
    n_centroids: int,
    nprobe: int,
    k: int = 200
) -> Tuple[RetrievalBenchmark, np.ndarray]:
    """Evaluate IVF (Inverted File Index) retrieval.

    Args:
        items: Item embedding matrix, shape (n_items, dim).
        queries: Query embedding matrix, shape (n_queries, dim).
        n_centroids: Number of Voronoi cells.
        nprobe: Number of cells to search at query time.
        k: Number of candidates to retrieve.

    Returns:
        Tuple of (benchmark results, ground truth indices for recall computation).
    """
    n_items, dim = items.shape
    n_queries = queries.shape[0]
    rng = np.random.RandomState(42)

    # Build: k-means clustering
    build_start = time.perf_counter()

    # Simplified k-means (production: use FAISS's optimized implementation)
    centroid_idx = rng.choice(n_items, n_centroids, replace=False)
    centroids = items[centroid_idx].copy()

    for iteration in range(10):
        # Assign items to nearest centroid
        # Process in chunks to avoid memory explosion
        chunk_size = 50_000
        assignments = np.empty(n_items, dtype=np.int32)
        for start in range(0, n_items, chunk_size):
            end = min(start + chunk_size, n_items)
            sims = items[start:end] @ centroids.T
            assignments[start:end] = np.argmax(sims, axis=1)

        # Update centroids
        for c in range(n_centroids):
            mask = assignments == c
            if np.any(mask):
                centroids[c] = items[mask].mean(axis=0)
                centroids[c] /= np.linalg.norm(centroids[c]) + 1e-10

    # Build inverted lists
    inverted_lists: Dict[int, List[int]] = {}
    for idx in range(n_items):
        c = int(assignments[idx])
        if c not in inverted_lists:
            inverted_lists[c] = []
        inverted_lists[c].append(idx)

    build_time = time.perf_counter() - build_start

    # Ground truth (for recall computation)
    ground_truth = np.zeros((n_queries, k), dtype=np.int32)
    for i in range(n_queries):
        scores = queries[i] @ items.T
        ground_truth[i] = np.argpartition(-scores, k)[:k]

    # Query
    latencies = []
    recalls = []
    for i in range(n_queries):
        start = time.perf_counter()

        # Find nearest centroids
        centroid_scores = queries[i] @ centroids.T
        probe_cells = np.argpartition(-centroid_scores, nprobe)[:nprobe]

        # Collect and score candidates
        candidates = []
        for c in probe_cells:
            if c in inverted_lists:
                candidates.extend(inverted_lists[c])

        if len(candidates) >= k:
            candidate_arr = np.array(candidates)
            scores = queries[i] @ items[candidate_arr].T
            top_local = np.argpartition(-scores, k)[:k]
            retrieved = set(candidate_arr[top_local])
        else:
            retrieved = set(candidates)

        latencies.append((time.perf_counter() - start) * 1000)

        # Recall
        true_set = set(ground_truth[i])
        recalls.append(len(retrieved & true_set) / k)

    memory_mb = (items.nbytes + centroids.nbytes) / (1024 ** 2)

    return RetrievalBenchmark(
        method=f"IVF (K={n_centroids}, nprobe={nprobe})",
        n_items=n_items,
        recall_at_200=np.mean(recalls),
        p50_ms=np.percentile(latencies, 50),
        p99_ms=np.percentile(latencies, 99),
        build_time_s=build_time,
        memory_mb=memory_mb,
    ), ground_truth


# Run evaluation at 200K scale
np.random.seed(42)
n_items = 200_000
dim = 256
n_queries = 500

items = np.random.randn(n_items, dim).astype(np.float32)
items /= np.linalg.norm(items, axis=1, keepdims=True)
queries = np.random.randn(n_queries, dim).astype(np.float32)
queries /= np.linalg.norm(queries, axis=1, keepdims=True)

# Evaluate brute force
bf_result = evaluate_brute_force(items, queries, k=200)

# Evaluate IVF at two operating points
n_centroids = int(np.sqrt(n_items))  # ~447
ivf_low, gt = evaluate_ivf(items, queries, n_centroids=n_centroids, nprobe=10, k=200)
ivf_high, _ = evaluate_ivf(items, queries, n_centroids=n_centroids, nprobe=40, k=200)

# Print results
print(f"StreamRec Retrieval Evaluation: {n_items:,} items, {dim}-dim embeddings")
print(f"{'':=<90}")
print(f"{'Method':<35s} {'Recall@200':>10s} {'p50 (ms)':>10s} {'p99 (ms)':>10s} "
      f"{'Build (s)':>10s} {'Mem (MB)':>10s}")
print(f"{'':->90}")

for result in [bf_result, ivf_low, ivf_high]:
    print(f"{result.method:<35s} {result.recall_at_200:>10.3f} {result.p50_ms:>10.2f} "
          f"{result.p99_ms:>10.2f} {result.build_time_s:>10.1f} {result.memory_mb:>10.1f}")
StreamRec Retrieval Evaluation: 200,000 items, 256-dim embeddings
==========================================================================================
Method                              Recall@200   p50 (ms)   p99 (ms)  Build (s)   Mem (MB)
------------------------------------------------------------------------------------------
Brute Force                              1.000       4.31       8.74        0.0      195.3
IVF (K=447, nprobe=10)                   0.689       0.58       1.23       42.3      195.7
IVF (K=447, nprobe=40)                   0.912       1.87       3.41       42.1      195.7

Interpreting the Results

Three findings emerge from the evaluation:

Finding 1: Brute force is still viable at 200K, but the trend is clear. At 4.3ms p50 and 8.7ms p99, brute-force retrieval consumes approximately 9% of the 100ms latency budget. This is tolerable today, but the latency scales linearly with catalog size. At 1M items, p99 would exceed 40ms — consuming the entire retrieval budget and leaving no room for ranking.

Finding 2: IVF provides a tunable recall-latency tradeoff. By adjusting nprobe, the team can trade recall for latency. At nprobe=10, latency drops to 0.6ms (13x speedup) but recall falls to 69% — too low for the recommendation quality target. At nprobe=40, recall reaches 91% with 1.9ms latency (2.3x speedup). The team's target is recall@200 $\geq$ 95%.

Finding 3: Build time scales with k-means convergence. The 42-second build time for 200K items is well within the 30-minute reindexing window. At 1M items, k-means training will take approximately 4-5 minutes; at 10M, 30-60 minutes. Build time is not the bottleneck today, but it will be at 10M.

Scaling Projections

# Project performance at larger catalog sizes (extrapolation)
print("Projected Performance at Scale")
print(f"{'':=<80}")
print(f"{'Scale':<12s} {'Method':<25s} {'Est. p99 (ms)':>14s} {'Meets SLA?':>12s}")
print(f"{'':->80}")

for n in [200_000, 1_000_000, 10_000_000]:
    # Brute force: linear in n
    bf_p99 = 8.74 * (n / 200_000)
    sla = "YES" if bf_p99 < 15 else "NO"
    print(f"{n:>10,}  {'Brute Force':<25s} {bf_p99:>14.1f} {sla:>12s}")

    # IVF (nprobe=40): sub-linear (nprobe * n/K, where K ~ sqrt(n))
    k_centroids = int(np.sqrt(n))
    # Centroid search: linear in K; cell search: linear in n/K * nprobe
    ivf_p99 = 3.41 * (np.sqrt(n) / np.sqrt(200_000))
    sla = "YES" if ivf_p99 < 15 else "NO"
    print(f"{'':>12s}{'IVF (nprobe=40)':<25s} {ivf_p99:>14.1f} {sla:>12s}")

    # HNSW: O(log n) growth
    hnsw_p99 = 1.0 * np.log2(n) / np.log2(200_000)  # ~1ms at 200K baseline
    sla = "YES" if hnsw_p99 < 15 else "NO"
    print(f"{'':>12s}{'HNSW (ef=128)':<25s} {hnsw_p99:>14.1f} {sla:>12s}")

    print()
Projected Performance at Scale
================================================================================
Scale        Method                     Est. p99 (ms)   Meets SLA?
--------------------------------------------------------------------------------
   200,000  Brute Force                           8.7          YES
            IVF (nprobe=40)                       3.4          YES
            HNSW (ef=128)                         1.0          YES

 1,000,000  Brute Force                          43.7           NO
            IVF (nprobe=40)                       7.6          YES
            HNSW (ef=128)                         1.1          YES

10,000,000  Brute Force                         437.0           NO
            IVF (nprobe=40)                      24.1           NO
            HNSW (ef=128)                         1.3          YES

The Team's Decision

The projections make the decision clear:

  1. Immediate term (200K-500K items): Migrate from brute force to IVF-PQ using FAISS. The build time is fast, the recall is tunable, and the latency improvement provides headroom for catalog growth. IVF also supports product quantization for memory compression — important as the catalog grows.

  2. Medium term (500K-5M items): Switch to HNSW. Its $O(\log n)$ query scaling outperforms IVF's $O(\sqrt{n})$ at scale. The tradeoff is higher memory (HNSW stores the graph structure) and longer build times, but both are within the operational budget.

  3. Long term (>5M items): Evaluate composite indices (IVF-HNSW-PQ in FAISS, ScaNN's learned quantization) and sharded serving (distributing the index across multiple machines). At this scale, the memory constraint becomes binding before the latency constraint.

Lessons for Practice

Production Reality: The team spent two weeks on this evaluation — not because the algorithms are complex, but because production evaluation requires:

  • Representative queries: Using actual user embeddings from production, not random vectors. The distribution of queries affects recall significantly.
  • Tail latency measurement: p99 (not median) is the SLA metric. ANN methods can have high variance due to graph traversal paths (HNSW) or variable cell sizes (IVF).
  • Recall measurement against the ranking model: Retrieval recall matters only insofar as it affects final recommendation quality. A 5% recall drop in retrieval may cause only a 0.3% drop in ranking NDCG if the missed items are low-quality.
  • Index update strategy: HNSW supports incremental insertion but not deletion. IVF requires full retraining when the data distribution shifts. The operational model must match the index type.
  • Monitoring: In production, retrieval recall is not directly observable (you don't know the true top-200 at serving time). The team monitors proxy metrics: distribution of candidate scores, diversity of retrieved candidates, and offline periodic recall evaluation against brute-force snapshots.

The core lesson: algorithmic complexity analysis tells you when to start looking for approximate methods, but production evaluation tells you which approximate method works for your system. The analysis in this chapter provides the first answer; the engineering in Chapter 24 provides the second.