Case Study 1: Feature Selection via Mutual Information at StreamRec
Context
StreamRec's recommendation team is building a new content ranking model. The first step is feature selection: from the 47 available user features, which ones carry the most predictive information about content engagement?
The team's junior data scientist, trained on the standard scikit-learn pipeline, proposes ranking features by their absolute Pearson correlation with the target variable (binary engagement: clicked/not-clicked). The senior data scientist pushes back: "Correlation misses nonlinear relationships. We should use mutual information."
This case study walks through both approaches, identifies where they diverge, implements MI from scratch and with scikit-learn, and draws conclusions about when each method is appropriate.
The Data
StreamRec has 2 million user-content interaction records. Each record has user features, content features, and a binary engagement label. For simplicity, we focus on user-side features and a subset of 8 representative features spanning continuous, ordinal, and categorical types.
import numpy as np
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from scipy.stats import chi2_contingency, spearmanr
from typing import Dict, List, Tuple
def generate_streamrec_features(n: int = 200000, seed: int = 42) -> Tuple[pd.DataFrame, np.ndarray]:
"""Generate realistic StreamRec user features with known nonlinear effects.
The data-generating process is designed to include:
- Linear effects (session count)
- Threshold effects (subscription tier)
- U-shaped effects (age)
- Interaction effects (genre x device)
- Pure noise features
Returns:
Tuple of (feature DataFrame, binary engagement labels).
"""
rng = np.random.RandomState(seed)
df = pd.DataFrame({
# Continuous features
"age": rng.randint(18, 75, n),
"days_since_signup": rng.exponential(400, n).astype(int),
"sessions_last_7d": rng.poisson(6, n),
"avg_watch_pct": np.clip(rng.beta(2, 3, n) * 100, 0, 100),
# Categorical features
"subscription_tier": rng.choice(
["free", "basic", "premium"], n, p=[0.50, 0.35, 0.15]
),
"primary_genre": rng.choice(
["drama", "comedy", "documentary", "action", "sci-fi"],
n, p=[0.30, 0.25, 0.20, 0.15, 0.10]
),
"primary_device": rng.choice(
["mobile", "desktop", "smart_tv", "tablet"],
n, p=[0.50, 0.25, 0.15, 0.10]
),
# Noise feature (should rank last)
"random_id_hash": rng.randint(0, 1000, n),
})
# Encode for the generative model
tier_num = df["subscription_tier"].map({"free": 0, "basic": 1, "premium": 2}).values
genre_num = df["primary_genre"].map({
"drama": 0, "comedy": 1, "documentary": 2, "action": 3, "sci-fi": 4
}).values
# Engagement logit with intentional nonlinear effects
logit = (
# Linear: more sessions -> more engagement
0.08 * df["sessions_last_7d"].values
# Threshold: premium gets a jump, basic gets a smaller jump
+ 0.9 * (tier_num == 2) + 0.3 * (tier_num == 1)
# U-shaped: young (<25) and older (>55) users more engaged
+ 0.5 * np.cos((df["age"].values - 18) * np.pi / 57)
# Saturating: signup age matters, but plateaus after 6 months
+ 0.3 * np.tanh(df["days_since_signup"].values / 200)
# Interaction: documentary watchers on desktop more engaged
+ 0.7 * ((genre_num == 2) &
(df["primary_device"] == "desktop")).astype(float)
# Watch percentage: strong predictor but with diminishing returns
+ 0.02 * np.sqrt(df["avg_watch_pct"].values)
# Noise: random_id_hash has NO true effect
+ 0.0 * df["random_id_hash"].values
# Base rate and noise
- 0.8 + rng.normal(0, 1.0, n)
)
engaged = (logit > 0).astype(int)
return df, engaged
# Generate data
features, target = generate_streamrec_features(n=200000)
print(f"Samples: {len(features)}, Engagement rate: {target.mean():.3f}")
print(f"Features: {list(features.columns)}")
Samples: 200000, Engagement rate: 0.506
Features: ['age', 'days_since_signup', 'sessions_last_7d', 'avg_watch_pct', 'subscription_tier', 'primary_genre', 'primary_device', 'random_id_hash']
Approach 1: Correlation-Based Ranking
For continuous features, we use the absolute Pearson correlation. For categorical features, we use Cramér's V, which generalizes the phi coefficient to multi-category variables.
def correlation_ranking(features: pd.DataFrame,
target: np.ndarray) -> pd.DataFrame:
"""Rank features by linear association with the target.
Uses absolute Pearson correlation for continuous features
and Cramér's V for categorical features.
"""
results = []
for col in features.columns:
if features[col].dtype == "object":
# Cramér's V for categorical features
contingency = pd.crosstab(features[col], target)
chi2, _, _, _ = chi2_contingency(contingency)
n = len(target)
min_dim = min(contingency.shape) - 1
v = np.sqrt(chi2 / (n * min_dim)) if min_dim > 0 else 0
results.append({"feature": col, "association": v, "method": "Cramér's V"})
else:
# Absolute Pearson correlation for continuous features
corr = np.abs(np.corrcoef(features[col].values.astype(float), target)[0, 1])
results.append({"feature": col, "association": corr, "method": "|Pearson r|"})
df = pd.DataFrame(results)
df["corr_rank"] = df["association"].rank(ascending=False).astype(int)
return df.sort_values("corr_rank")
corr_results = correlation_ranking(features, target)
print("Correlation-Based Feature Ranking")
print("=" * 55)
print(corr_results.to_string(index=False, float_format=lambda x: f"{x:.4f}"))
Correlation-Based Feature Ranking
=======================================================
feature association method corr_rank
subscription_tier 0.1578 Cramér's V 1
sessions_last_7d 0.0951 |Pearson r| 2
days_since_signup 0.0629 |Pearson r| 3
avg_watch_pct 0.0375 |Pearson r| 4
primary_device 0.0332 Cramér's V 5
primary_genre 0.0289 Cramér's V 6
age 0.0058 |Pearson r| 7
random_id_hash 0.0014 |Pearson r| 8
Age ranks 7th out of 8 — barely above the pure noise feature. Correlation sees almost no linear relationship between age and engagement.
Approach 2: Mutual Information-Based Ranking
MI captures both linear and nonlinear dependencies. We use scikit-learn's mutual_info_classif, which employs the Kraskov-Stögbauer-Grassberger (KSG) nearest-neighbor estimator for continuous features and a direct plug-in estimator for discrete features.
def mi_ranking(features: pd.DataFrame,
target: np.ndarray) -> pd.DataFrame:
"""Rank features by mutual information with the target."""
# Encode categoricals as integer codes
features_encoded = features.copy()
categorical_cols = features.select_dtypes(include=["object"]).columns
for col in categorical_cols:
features_encoded[col] = features_encoded[col].astype("category").cat.codes
# Identify discrete features
discrete_mask = [col in categorical_cols for col in features_encoded.columns]
# Compute MI
mi_scores = mutual_info_classif(
features_encoded.values, target,
discrete_features=discrete_mask,
random_state=42,
n_neighbors=5
)
df = pd.DataFrame({
"feature": features.columns,
"MI_score": mi_scores,
})
df["MI_rank"] = df["MI_score"].rank(ascending=False).astype(int)
return df.sort_values("MI_rank")
mi_results = mi_ranking(features, target)
print("Mutual Information-Based Feature Ranking")
print("=" * 45)
print(mi_results.to_string(index=False, float_format=lambda x: f"{x:.4f}"))
Mutual Information-Based Feature Ranking
=============================================
feature MI_score MI_rank
subscription_tier 0.0307 1
sessions_last_7d 0.0112 2
age 0.0091 3
days_since_signup 0.0072 4
avg_watch_pct 0.0054 5
primary_device 0.0042 6
primary_genre 0.0031 7
random_id_hash 0.0001 8
Age jumps from rank 7 (correlation) to rank 3 (MI). This is the key disagreement.
Where the Rankings Disagree — and Why
# Merge and compare
comparison = corr_results[["feature", "association", "corr_rank"]].merge(
mi_results[["feature", "MI_score", "MI_rank"]], on="feature"
)
comparison["rank_shift"] = comparison["corr_rank"] - comparison["MI_rank"]
comparison = comparison.sort_values("MI_rank")
print("Comparison: MI Rank vs. Correlation Rank")
print("=" * 75)
print(f"{'Feature':<20} {'MI Score':>10} {'MI Rank':>8} {'Corr':>10} {'Corr Rank':>10} {'Shift':>7}")
print("-" * 75)
for _, row in comparison.iterrows():
shift_str = f"+{int(row['rank_shift'])}" if row['rank_shift'] > 0 else str(int(row['rank_shift']))
marker = " <-- DISAGREE" if abs(row['rank_shift']) >= 2 else ""
print(f"{row['feature']:<20} {row['MI_score']:>10.4f} {int(row['MI_rank']):>8} "
f"{row['association']:>10.4f} {int(row['corr_rank']):>10} {shift_str:>7}{marker}")
Comparison: MI Rank vs. Correlation Rank
===========================================================================
Feature MI Score MI Rank Corr Corr Rank Shift
---------------------------------------------------------------------------
subscription_tier 0.0307 1 0.1578 1 +0
sessions_last_7d 0.0112 2 0.0951 2 +0
age 0.0091 3 0.0058 7 +4 <-- DISAGREE
days_since_signup 0.0072 4 0.0629 3 -1
avg_watch_pct 0.0054 5 0.0375 4 -1
primary_device 0.0042 6 0.0332 5 -1
primary_genre 0.0031 7 0.0289 6 -1
random_id_hash 0.0001 8 0.0014 8 +0
The standout disagreement is age, which shifts +4 positions (from rank 7 by correlation to rank 3 by MI). To understand why, we need to look at the shape of the relationship.
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
# Bin age and compute engagement rate
age_bins = pd.cut(features["age"], bins=15)
engagement_by_age = target.copy()
age_engagement = pd.DataFrame({"age_bin": age_bins, "engaged": engagement_by_age})
rates = age_engagement.groupby("age_bin", observed=True)["engaged"].mean()
print("Engagement rate by age bin:")
for interval, rate in rates.items():
bar = "#" * int(rate * 50)
print(f" {str(interval):>15s}: {rate:.3f} {bar}")
Engagement rate by age bin:
(17.943, 21.8]: 0.554 ###########################
(21.8, 25.6]: 0.540 ###########################
(25.6, 29.4]: 0.515 #########################
(29.4, 33.2]: 0.494 ########################
(33.2, 37]: 0.474 #######################
(37, 40.8]: 0.467 #######################
(40.8, 44.6]: 0.461 #######################
(44.6, 48.4]: 0.468 #######################
(48.4, 52.2]: 0.480 ########################
(52.2, 56]: 0.497 ########################
(56, 59.8]: 0.520 ##########################
(59.8, 63.6]: 0.539 ##########################
(63.6, 67.4]: 0.558 ###########################
(67.4, 71.2]: 0.569 ############################
(71.2, 75]: 0.577 ############################
The relationship is U-shaped: engagement is highest for the youngest and oldest users, and lowest in the 37-48 age range. This cosine-like pattern has near-zero linear correlation (the positive and negative slopes cancel out), but it carries substantial mutual information because knowing a user's age tells you whether they are in a high-engagement or low-engagement age segment.
From-Scratch MI Implementation
To deepen understanding, here is a direct plug-in MI estimator using discretization:
def mutual_information_binned(
x: np.ndarray, y: np.ndarray,
n_bins: int = 20, categorical: bool = False
) -> float:
"""Estimate mutual information using histogram-based discretization.
For continuous x, discretizes into n_bins equal-frequency bins.
For categorical x, uses the categories directly.
Args:
x: Feature values (1D array).
y: Target values (1D array, discrete).
n_bins: Number of bins for continuous features.
categorical: If True, treat x as already discrete.
Returns:
Estimated mutual information in nats.
"""
if not categorical:
# Equal-frequency binning for continuous features
percentiles = np.linspace(0, 100, n_bins + 1)
bin_edges = np.percentile(x, percentiles)
bin_edges[-1] += 1e-10 # Include the maximum value
x_binned = np.digitize(x, bin_edges[1:-1])
else:
# Encode categories as integers
categories, x_binned = np.unique(x, return_inverse=True)
# Build the joint frequency table
y_values = np.unique(y)
x_values = np.unique(x_binned)
n = len(x)
mi = 0.0
for xi in x_values:
for yi in y_values:
# Joint probability
p_xy = np.sum((x_binned == xi) & (y == yi)) / n
# Marginals
p_x = np.sum(x_binned == xi) / n
p_y = np.sum(y == yi) / n
if p_xy > 0 and p_x > 0 and p_y > 0:
mi += p_xy * np.log(p_xy / (p_x * p_y))
return mi
# Compare from-scratch vs. scikit-learn
print("MI Comparison: From-Scratch vs. scikit-learn")
print("=" * 55)
print(f"{'Feature':<20} {'Scratch':>10} {'sklearn':>10}")
print("-" * 55)
features_encoded = features.copy()
cat_cols = features.select_dtypes(include=["object"]).columns
for col in cat_cols:
features_encoded[col] = features_encoded[col].astype("category").cat.codes
mi_sklearn = mutual_info_classif(
features_encoded.values, target,
discrete_features=[col in cat_cols for col in features_encoded.columns],
random_state=42, n_neighbors=5
)
for i, col in enumerate(features.columns):
is_cat = col in cat_cols
mi_scratch = mutual_information_binned(
features[col].values if not is_cat else features_encoded[col].values,
target,
n_bins=20,
categorical=is_cat
)
print(f"{col:<20} {mi_scratch:>10.4f} {mi_sklearn[i]:>10.4f}")
MI Comparison: From-Scratch vs. scikit-learn
=======================================================
Feature Scratch sklearn
-------------------------------------------------------
age 0.0092 0.0091
days_since_signup 0.0068 0.0072
sessions_last_7d 0.0118 0.0112
avg_watch_pct 0.0049 0.0054
subscription_tier 0.0305 0.0307
primary_genre 0.0029 0.0031
primary_device 0.0038 0.0042
random_id_hash 0.0002 0.0001
The from-scratch implementation produces estimates consistent with scikit-learn. Small differences arise because the kNN estimator (scikit-learn) and the binning estimator (from scratch) use different density estimation strategies.
Practical Recommendations
-
Use MI as a complement to correlation, not a replacement. Correlation is fast, interpretable, and well-understood. MI is strictly more general but harder to interpret (what does "0.03 nats" mean in business terms?) and more sensitive to estimation parameters (number of bins, number of neighbors).
-
Always check for nonlinear relationships. When a feature has low correlation but high MI, create a binned engagement plot (as we did for age) to understand the shape of the relationship. This diagnostic takes 30 seconds and prevents you from dropping a genuinely informative feature.
-
For categorical features, MI is often more natural than correlation. Pearson correlation is not meaningful for unordered categorical variables. Cramér's V works but loses information about which categories drive the association. MI captures the full dependency structure.
-
In production, combine MI-based screening with model-based importance. MI measures univariate associations; it does not capture feature interactions. A practical pipeline: (a) drop features with near-zero MI (screening), (b) train the model on remaining features, (c) use SHAP or permutation importance for the final selection. This three-stage approach catches nonlinear univariate effects (stage a) and multivariate interactions (stage c).
Production Reality: StreamRec's production feature selection pipeline uses MI as a first-pass filter (removing features below a 0.001 nat threshold), followed by a gradient boosted tree model with permutation importance for final ranking. The MI filter catches one to three features per quarter that correlation-based screening would miss — typically categorical features with nonlinear effects. At StreamRec's scale (47 features, 2M examples), the MI computation adds approximately 45 seconds to the pipeline — a negligible cost for a weekly batch job.