Case Study 1 — StreamRec Click Prediction
From Handcrafted Features to MLP: Building the Neural Baseline
Context
StreamRec, a content streaming platform with approximately 5 million users and 200,000 items, needs to predict whether a user will click on a recommended item. In Chapter 1, we built a matrix factorization baseline that uses latent user and item factors. In Chapter 4, we used mutual information to rank features by their nonlinear dependence with the target.
Now we take the next step: training a multi-layer perceptron on handcrafted user-item features. This MLP serves as the neural baseline — the simplest neural model — against which all subsequent architectures (CNNs in Chapter 8, transformers in Chapter 10, two-tower models in Chapter 13) will be compared.
The question is concrete: does the MLP capture nonlinear feature interactions that logistic regression misses, and by how much?
Feature Engineering
StreamRec's data science team has constructed 20 features for each user-item pair, grouped into three categories:
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple
@dataclass
class FeatureSpec:
"""Specification for a StreamRec feature."""
name: str
category: str # "user", "item", or "interaction"
description: str
STREAMREC_FEATURES: List[FeatureSpec] = [
# User features (indices 0-7)
FeatureSpec("user_activity_level", "user",
"Log of total interactions in past 30 days"),
FeatureSpec("user_session_count", "user",
"Number of sessions in past 7 days"),
FeatureSpec("user_avg_session_length", "user",
"Mean session duration in minutes"),
FeatureSpec("user_content_diversity", "user",
"Entropy of category distribution in history"),
FeatureSpec("user_recency", "user",
"Days since last visit (log-transformed)"),
FeatureSpec("user_subscription_tier", "user",
"0=free, 1=basic, 2=premium"),
FeatureSpec("user_platform", "user",
"One-hot: mobile(0)/desktop(1)/tablet(2)"),
FeatureSpec("user_tenure_days", "user",
"Days since account creation (log-transformed)"),
# Item features (indices 8-14)
FeatureSpec("item_popularity", "item",
"Log of total interactions in past 30 days"),
FeatureSpec("item_freshness", "item",
"Days since publication (log-transformed)"),
FeatureSpec("item_creator_reputation", "item",
"Creator's average item engagement rate"),
FeatureSpec("item_length_minutes", "item",
"Content duration (log-transformed)"),
FeatureSpec("item_engagement_rate", "item",
"Historical click-through rate"),
FeatureSpec("item_completion_rate", "item",
"Historical completion rate"),
FeatureSpec("item_trending_score", "item",
"Velocity of recent interactions"),
# Interaction features (indices 15-19)
FeatureSpec("category_match", "interaction",
"Cosine similarity: user pref vs item category"),
FeatureSpec("time_of_day_sin", "interaction",
"sin(2*pi*hour/24) — captures circadian pattern"),
FeatureSpec("time_of_day_cos", "interaction",
"cos(2*pi*hour/24) — captures circadian pattern"),
FeatureSpec("historical_ctr_user_category", "interaction",
"User's historical CTR in this item's category"),
FeatureSpec("collaborative_score", "interaction",
"Dot product of user/item latent factors from Ch.1"),
]
The interaction features are particularly important. The category_match and historical_ctr_user_category features encode the user-item affinity that matrix factorization captured implicitly. The collaborative_score feature directly incorporates the matrix factorization output from Chapter 1 as an input feature — a common production pattern where simple models feed into complex ones.
Logistic Regression Baseline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, log_loss, average_precision_score
def train_logistic_baseline(
X_train: np.ndarray,
y_train: np.ndarray,
X_val: np.ndarray,
y_val: np.ndarray,
X_test: np.ndarray,
y_test: np.ndarray,
) -> dict:
"""Train logistic regression with standard scaling.
Args:
X_train, y_train: Training data.
X_val, y_val: Validation data.
X_test, y_test: Test data.
Returns:
Dictionary with metrics for all splits.
"""
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_val_s = scaler.transform(X_val)
X_test_s = scaler.transform(X_test)
model = LogisticRegression(
C=1.0, max_iter=1000, random_state=42, solver="lbfgs"
)
model.fit(X_train_s, y_train)
results = {}
for name, X, y in [
("train", X_train_s, y_train),
("val", X_val_s, y_val),
("test", X_test_s, y_test),
]:
y_prob = model.predict_proba(X)[:, 1]
results[f"{name}_auc"] = roc_auc_score(y, y_prob)
results[f"{name}_logloss"] = log_loss(y, y_prob)
results[f"{name}_ap"] = average_precision_score(y, y_prob)
return results
MLP Implementation
def train_click_mlp(
X_train: np.ndarray,
y_train: np.ndarray,
X_val: np.ndarray,
y_val: np.ndarray,
X_test: np.ndarray,
y_test: np.ndarray,
hidden_dims: List[int] = [128, 64, 32],
epochs: int = 200,
batch_size: int = 128,
lr: float = 0.005,
seed: int = 42,
) -> Tuple[dict, dict]:
"""Train an MLP for click prediction and evaluate.
Args:
X_train, y_train: Training data.
X_val, y_val: Validation data.
X_test, y_test: Test data.
hidden_dims: Hidden layer dimensions.
epochs: Number of training epochs.
batch_size: Mini-batch size.
lr: Learning rate.
seed: Random seed.
Returns:
Tuple of (metrics_dict, training_history).
"""
# Standardize features
mean = X_train.mean(axis=0)
std = X_train.std(axis=0) + 1e-8
X_train_s = (X_train - mean) / std
X_val_s = (X_val - mean) / std
X_test_s = (X_test - mean) / std
input_dim = X_train.shape[1]
layer_dims = [input_dim] + hidden_dims + [1]
config = MLPConfig(
layer_dims=layer_dims,
activation="relu",
output_activation="sigmoid",
seed=seed,
)
model = NumpyMLP(config)
history = model.train(
X_train=X_train_s,
y_train=y_train,
X_val=X_val_s,
y_val=y_val,
epochs=epochs,
batch_size=batch_size,
lr=lr,
verbose=False,
)
# Evaluate
results = {}
for name, X, y in [
("train", X_train_s, y_train),
("val", X_val_s, y_val),
("test", X_test_s, y_test),
]:
y_prob = model.forward(X).ravel()
results[f"{name}_auc"] = roc_auc_score(y, y_prob)
results[f"{name}_logloss"] = binary_cross_entropy(y_prob, y)
results[f"{name}_ap"] = average_precision_score(y, y_prob)
return results, history
Results and Analysis
We generate a dataset with the nonlinear interactions described in Section 6.8 and compare the two models:
data = generate_click_data(n_samples=20_000, n_features=20, seed=42)
lr_results = train_logistic_baseline(
data["X_train"], data["y_train"],
data["X_val"], data["y_val"],
data["X_test"], data["y_test"],
)
mlp_results, mlp_history = train_click_mlp(
data["X_train"], data["y_train"],
data["X_val"], data["y_val"],
data["X_test"], data["y_test"],
)
print("=" * 60)
print(f"{'Metric':<20} {'LogReg':>12} {'MLP':>12} {'Diff':>12}")
print("=" * 60)
for metric in ["test_auc", "test_logloss", "test_ap"]:
lr_val = lr_results[metric]
mlp_val = mlp_results[metric]
diff = mlp_val - lr_val
sign = "+" if diff > 0 else ""
print(f"{metric:<20} {lr_val:>12.4f} {mlp_val:>12.4f} {sign}{diff:>11.4f}")
Expected results (approximate):
| Metric | Logistic Regression | MLP [128, 64, 32] | Improvement |
|---|---|---|---|
| Test AUC | 0.700 | 0.755 | +0.055 |
| Test LogLoss | 0.635 | 0.590 | -0.045 |
| Test AP | 0.685 | 0.740 | +0.055 |
The MLP outperforms logistic regression by roughly 5 AUC points. This gap is entirely explained by the nonlinear feature interactions in the data-generating process: the $x_0 \cdot x_1$ interaction, the $x_2^2$ quadratic term, and the $\sin(x_3 \pi)$ periodic effect cannot be captured by a linear model operating on raw features.
Architecture Search
How sensitive are the results to the MLP architecture?
architectures = [
[32, 1], # Shallow: 1 hidden layer
[64, 32, 1], # Medium: 2 hidden layers
[128, 64, 32, 1], # Deep: 3 hidden layers (default)
[256, 128, 64, 32, 1], # Deeper: 4 hidden layers
]
for arch in architectures:
dims = [20] + arch
results, _ = train_click_mlp(
data["X_train"], data["y_train"],
data["X_val"], data["y_val"],
data["X_test"], data["y_test"],
hidden_dims=arch[:-1],
)
print(f"Architecture {str(dims):<30} "
f"Val AUC: {results['val_auc']:.4f} "
f"Test AUC: {results['test_auc']:.4f}")
Expected pattern: the single hidden layer already captures most of the nonlinear interactions. Two hidden layers provide marginal improvement. Three or four layers offer diminishing returns and may show slight overfitting (training AUC increases but validation AUC plateaus or decreases). For tabular data with 20 features and modest nonlinearity, a 2-3 layer MLP is typically sufficient.
Learning Rate Sensitivity
The learning rate is the most important hyperparameter. Too large, and the loss oscillates or diverges. Too small, and convergence is prohibitively slow.
learning_rates = [0.1, 0.05, 0.01, 0.005, 0.001, 0.0001]
for lr_val in learning_rates:
results, history = train_click_mlp(
data["X_train"], data["y_train"],
data["X_val"], data["y_val"],
data["X_test"], data["y_test"],
hidden_dims=[128, 64, 32],
epochs=200,
lr=lr_val,
)
final_val_loss = history["val_loss"][-1]
print(f"LR = {lr_val:<8} Val AUC: {results['val_auc']:.4f} "
f"Final Val Loss: {final_val_loss:.4f}")
Expected pattern: learning rates above 0.05 cause training instability (oscillating or diverging loss). Learning rates around 0.005-0.01 converge to the best validation performance. Learning rates below 0.001 converge too slowly — the model underfits because it has not had enough effective updates in 200 epochs. This sensitivity is inherent to SGD with constant learning rates; Chapter 7 introduces learning rate schedules (warmup, cosine annealing) that reduce this sensitivity.
Error Analysis
Examining where the MLP fails reveals the decision boundary it has learned:
# Identify misclassified examples
test_pred = mlp.forward(X_test_s).ravel()
test_labels = data["y_test"]
predictions = (test_pred > 0.5).astype(int)
false_positives = np.where((predictions == 1) & (test_labels == 0))[0]
false_negatives = np.where((predictions == 0) & (test_labels == 1))[0]
print(f"False positives: {len(false_positives)} "
f"({100*len(false_positives)/len(test_labels):.1f}%)")
print(f"False negatives: {len(false_negatives)} "
f"({100*len(false_negatives)/len(test_labels):.1f}%)")
# Compare feature distributions of correct vs. incorrect predictions
correct = np.where(predictions == test_labels)[0]
incorrect = np.where(predictions != test_labels)[0]
print(f"\nMean predicted probability — correct: {test_pred[correct].mean():.4f}")
print(f"Mean predicted probability — incorrect: {test_pred[incorrect].mean():.4f}")
The misclassified examples tend to cluster near the decision boundary ($\hat{y} \approx 0.5$), where the model is least confident. This is expected: the data-generating process includes irreducible noise (the Bernoulli sampling), so even with a perfect model, examples near the boundary will be misclassified at roughly equal rates.
Key Takeaways from This Case Study
-
The MLP captures feature interactions that logistic regression cannot. The 5-point AUC improvement comes entirely from the nonlinear decision boundary. For data with truly linear decision boundaries, the MLP would offer no advantage.
-
Feature engineering still matters. The interaction features (category_match, collaborative_score) encode domain knowledge that even the MLP benefits from. Raw features without these interactions would require the network to discover them, which requires more data and capacity.
-
Architecture depth has diminishing returns on tabular data. Unlike image or text data, where depth captures hierarchical structure (edges to textures to objects), tabular data often has shallow nonlinear structure that a 2-3 layer network can capture.
-
This MLP is the baseline, not the final model. Chapter 7 will add batch normalization, dropout, and learning rate scheduling. Chapter 10 will replace the handcrafted features with learned representations. Chapter 13 will build a two-tower architecture that separates user and item encoding. Each improvement should be compared against this baseline.
Connection to the Progressive Project
This case study implements Milestone M2 of the StreamRec progressive project. The key artifact is the click-prediction MLP with a clear evaluation protocol (AUC, log loss, average precision) and a comparison against the logistic regression baseline. In Chapter 7, the same architecture will be trained with proper regularization. In Chapter 8, the item features will be replaced with 1D CNN embeddings extracted from content descriptions.