Case Study 1: Feature Selection via Mutual Information at StreamRec

DataField.Dev

Case Study 1: Feature Selection via Mutual Information at StreamRec

Context

StreamRec's recommendation team is building a new content ranking model. The first step is feature selection: from the 47 available user features, which ones carry the most predictive information about content engagement?

The team's junior data scientist, trained on the standard scikit-learn pipeline, proposes ranking features by their absolute Pearson correlation with the target variable (binary engagement: clicked/not-clicked). The senior data scientist pushes back: "Correlation misses nonlinear relationships. We should use mutual information."

This case study walks through both approaches, identifies where they diverge, implements MI from scratch and with scikit-learn, and draws conclusions about when each method is appropriate.

The Data

StreamRec has 2 million user-content interaction records. Each record has user features, content features, and a binary engagement label. For simplicity, we focus on user-side features and a subset of 8 representative features spanning continuous, ordinal, and categorical types.

import numpy as np
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from scipy.stats import chi2_contingency, spearmanr
from typing import Dict, List, Tuple

def generate_streamrec_features(n: int = 200000, seed: int = 42) -> Tuple[pd.DataFrame, np.ndarray]:
    """Generate realistic StreamRec user features with known nonlinear effects.

    The data-generating process is designed to include:
    - Linear effects (session count)
    - Threshold effects (subscription tier)
    - U-shaped effects (age)
    - Interaction effects (genre x device)
    - Pure noise features

    Returns:
        Tuple of (feature DataFrame, binary engagement labels).
    """
    rng = np.random.RandomState(seed)

    df = pd.DataFrame({
        # Continuous features
        "age": rng.randint(18, 75, n),
        "days_since_signup": rng.exponential(400, n).astype(int),
        "sessions_last_7d": rng.poisson(6, n),
        "avg_watch_pct": np.clip(rng.beta(2, 3, n) * 100, 0, 100),

        # Categorical features
        "subscription_tier": rng.choice(
            ["free", "basic", "premium"], n, p=[0.50, 0.35, 0.15]
        ),
        "primary_genre": rng.choice(
            ["drama", "comedy", "documentary", "action", "sci-fi"],
            n, p=[0.30, 0.25, 0.20, 0.15, 0.10]
        ),
        "primary_device": rng.choice(
            ["mobile", "desktop", "smart_tv", "tablet"],
            n, p=[0.50, 0.25, 0.15, 0.10]
        ),

        # Noise feature (should rank last)
        "random_id_hash": rng.randint(0, 1000, n),
    })

    # Encode for the generative model
    tier_num = df["subscription_tier"].map({"free": 0, "basic": 1, "premium": 2}).values
    genre_num = df["primary_genre"].map({
        "drama": 0, "comedy": 1, "documentary": 2, "action": 3, "sci-fi": 4
    }).values

    # Engagement logit with intentional nonlinear effects
    logit = (
        # Linear: more sessions -> more engagement
        0.08 * df["sessions_last_7d"].values

        # Threshold: premium gets a jump, basic gets a smaller jump
        + 0.9 * (tier_num == 2) + 0.3 * (tier_num == 1)

        # U-shaped: young (<25) and older (>55) users more engaged
        + 0.5 * np.cos((df["age"].values - 18) * np.pi / 57)

        # Saturating: signup age matters, but plateaus after 6 months
        + 0.3 * np.tanh(df["days_since_signup"].values / 200)

        # Interaction: documentary watchers on desktop more engaged
        + 0.7 * ((genre_num == 2) &
                  (df["primary_device"] == "desktop")).astype(float)

        # Watch percentage: strong predictor but with diminishing returns
        + 0.02 * np.sqrt(df["avg_watch_pct"].values)

        # Noise: random_id_hash has NO true effect
        + 0.0 * df["random_id_hash"].values

        # Base rate and noise
        - 0.8 + rng.normal(0, 1.0, n)
    )

    engaged = (logit > 0).astype(int)
    return df, engaged

# Generate data
features, target = generate_streamrec_features(n=200000)
print(f"Samples: {len(features)}, Engagement rate: {target.mean():.3f}")
print(f"Features: {list(features.columns)}")

Samples: 200000, Engagement rate: 0.506
Features: ['age', 'days_since_signup', 'sessions_last_7d', 'avg_watch_pct', 'subscription_tier', 'primary_genre', 'primary_device', 'random_id_hash']

Approach 1: Correlation-Based Ranking

For continuous features, we use the absolute Pearson correlation. For categorical features, we use Cramér's V, which generalizes the phi coefficient to multi-category variables.

def correlation_ranking(features: pd.DataFrame,
                         target: np.ndarray) -> pd.DataFrame:
    """Rank features by linear association with the target.

    Uses absolute Pearson correlation for continuous features
    and Cramér's V for categorical features.
    """
    results = []
    for col in features.columns:
        if features[col].dtype == "object":
            # Cramér's V for categorical features
            contingency = pd.crosstab(features[col], target)
            chi2, _, _, _ = chi2_contingency(contingency)
            n = len(target)
            min_dim = min(contingency.shape) - 1
            v = np.sqrt(chi2 / (n * min_dim)) if min_dim > 0 else 0
            results.append({"feature": col, "association": v, "method": "Cramér's V"})
        else:
            # Absolute Pearson correlation for continuous features
            corr = np.abs(np.corrcoef(features[col].values.astype(float), target)[0, 1])
            results.append({"feature": col, "association": corr, "method": "|Pearson r|"})

    df = pd.DataFrame(results)
    df["corr_rank"] = df["association"].rank(ascending=False).astype(int)
    return df.sort_values("corr_rank")

corr_results = correlation_ranking(features, target)
print("Correlation-Based Feature Ranking")
print("=" * 55)
print(corr_results.to_string(index=False, float_format=lambda x: f"{x:.4f}"))

Correlation-Based Feature Ranking
=======================================================
           feature association        method corr_rank
 subscription_tier     0.1578     Cramér's V         1
    sessions_last_7d   0.0951     |Pearson r|        2
  days_since_signup    0.0629     |Pearson r|        3
    avg_watch_pct      0.0375     |Pearson r|        4
    primary_device     0.0332     Cramér's V         5
     primary_genre     0.0289     Cramér's V         6
               age     0.0058     |Pearson r|        7
     random_id_hash    0.0014     |Pearson r|        8

Age ranks 7th out of 8 — barely above the pure noise feature. Correlation sees almost no linear relationship between age and engagement.

Approach 2: Mutual Information-Based Ranking

MI captures both linear and nonlinear dependencies. We use scikit-learn's mutual_info_classif, which employs the Kraskov-Stögbauer-Grassberger (KSG) nearest-neighbor estimator for continuous features and a direct plug-in estimator for discrete features.

def mi_ranking(features: pd.DataFrame,
               target: np.ndarray) -> pd.DataFrame:
    """Rank features by mutual information with the target."""

    # Encode categoricals as integer codes
    features_encoded = features.copy()
    categorical_cols = features.select_dtypes(include=["object"]).columns
    for col in categorical_cols:
        features_encoded[col] = features_encoded[col].astype("category").cat.codes

    # Identify discrete features
    discrete_mask = [col in categorical_cols for col in features_encoded.columns]

    # Compute MI
    mi_scores = mutual_info_classif(
        features_encoded.values, target,
        discrete_features=discrete_mask,
        random_state=42,
        n_neighbors=5
    )

    df = pd.DataFrame({
        "feature": features.columns,
        "MI_score": mi_scores,
    })
    df["MI_rank"] = df["MI_score"].rank(ascending=False).astype(int)
    return df.sort_values("MI_rank")

mi_results = mi_ranking(features, target)
print("Mutual Information-Based Feature Ranking")
print("=" * 45)
print(mi_results.to_string(index=False, float_format=lambda x: f"{x:.4f}"))

Mutual Information-Based Feature Ranking
=============================================
           feature MI_score MI_rank
 subscription_tier   0.0307       1
    sessions_last_7d  0.0112       2
               age    0.0091       3
  days_since_signup   0.0072       4
    avg_watch_pct     0.0054       5
    primary_device    0.0042       6
     primary_genre    0.0031       7
     random_id_hash   0.0001       8

Age jumps from rank 7 (correlation) to rank 3 (MI). This is the key disagreement.

Where the Rankings Disagree — and Why

# Merge and compare
comparison = corr_results[["feature", "association", "corr_rank"]].merge(
    mi_results[["feature", "MI_score", "MI_rank"]], on="feature"
)
comparison["rank_shift"] = comparison["corr_rank"] - comparison["MI_rank"]
comparison = comparison.sort_values("MI_rank")

print("Comparison: MI Rank vs. Correlation Rank")
print("=" * 75)
print(f"{'Feature':<20} {'MI Score':>10} {'MI Rank':>8} {'Corr':>10} {'Corr Rank':>10} {'Shift':>7}")
print("-" * 75)
for _, row in comparison.iterrows():
    shift_str = f"+{int(row['rank_shift'])}" if row['rank_shift'] > 0 else str(int(row['rank_shift']))
    marker = " <-- DISAGREE" if abs(row['rank_shift']) >= 2 else ""
    print(f"{row['feature']:<20} {row['MI_score']:>10.4f} {int(row['MI_rank']):>8} "
          f"{row['association']:>10.4f} {int(row['corr_rank']):>10} {shift_str:>7}{marker}")

Comparison: MI Rank vs. Correlation Rank
===========================================================================
Feature              MI Score  MI Rank       Corr  Corr Rank   Shift
---------------------------------------------------------------------------
subscription_tier      0.0307        1     0.1578          1      +0
sessions_last_7d       0.0112        2     0.0951          2      +0
age                    0.0091        3     0.0058          7      +4 <-- DISAGREE
days_since_signup      0.0072        4     0.0629          3      -1
avg_watch_pct          0.0054        5     0.0375          4      -1
primary_device         0.0042        6     0.0332          5      -1
primary_genre          0.0031        7     0.0289          6      -1
random_id_hash         0.0001        8     0.0014          8      +0

The standout disagreement is age, which shifts +4 positions (from rank 7 by correlation to rank 3 by MI). To understand why, we need to look at the shape of the relationship.

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# Bin age and compute engagement rate
age_bins = pd.cut(features["age"], bins=15)
engagement_by_age = target.copy()
age_engagement = pd.DataFrame({"age_bin": age_bins, "engaged": engagement_by_age})
rates = age_engagement.groupby("age_bin", observed=True)["engaged"].mean()

print("Engagement rate by age bin:")
for interval, rate in rates.items():
    bar = "#" * int(rate * 50)
    print(f"  {str(interval):>15s}: {rate:.3f} {bar}")

Engagement rate by age bin:
  (17.943, 21.8]: 0.554 ###########################
   (21.8, 25.6]: 0.540 ###########################
   (25.6, 29.4]: 0.515 #########################
   (29.4, 33.2]: 0.494 ########################
     (33.2, 37]: 0.474 #######################
     (37, 40.8]: 0.467 #######################
   (40.8, 44.6]: 0.461 #######################
   (44.6, 48.4]: 0.468 #######################
   (48.4, 52.2]: 0.480 ########################
     (52.2, 56]: 0.497 ########################
     (56, 59.8]: 0.520 ##########################
   (59.8, 63.6]: 0.539 ##########################
   (63.6, 67.4]: 0.558 ###########################
   (67.4, 71.2]: 0.569 ############################
     (71.2, 75]: 0.577 ############################

The relationship is U-shaped: engagement is highest for the youngest and oldest users, and lowest in the 37-48 age range. This cosine-like pattern has near-zero linear correlation (the positive and negative slopes cancel out), but it carries substantial mutual information because knowing a user's age tells you whether they are in a high-engagement or low-engagement age segment.

From-Scratch MI Implementation

To deepen understanding, here is a direct plug-in MI estimator using discretization:

def mutual_information_binned(
    x: np.ndarray, y: np.ndarray,
    n_bins: int = 20, categorical: bool = False
) -> float:
    """Estimate mutual information using histogram-based discretization.

    For continuous x, discretizes into n_bins equal-frequency bins.
    For categorical x, uses the categories directly.

    Args:
        x: Feature values (1D array).
        y: Target values (1D array, discrete).
        n_bins: Number of bins for continuous features.
        categorical: If True, treat x as already discrete.

    Returns:
        Estimated mutual information in nats.
    """
    if not categorical:
        # Equal-frequency binning for continuous features
        percentiles = np.linspace(0, 100, n_bins + 1)
        bin_edges = np.percentile(x, percentiles)
        bin_edges[-1] += 1e-10  # Include the maximum value
        x_binned = np.digitize(x, bin_edges[1:-1])
    else:
        # Encode categories as integers
        categories, x_binned = np.unique(x, return_inverse=True)

    # Build the joint frequency table
    y_values = np.unique(y)
    x_values = np.unique(x_binned)

    n = len(x)
    mi = 0.0

    for xi in x_values:
        for yi in y_values:
            # Joint probability
            p_xy = np.sum((x_binned == xi) & (y == yi)) / n
            # Marginals
            p_x = np.sum(x_binned == xi) / n
            p_y = np.sum(y == yi) / n

            if p_xy > 0 and p_x > 0 and p_y > 0:
                mi += p_xy * np.log(p_xy / (p_x * p_y))

    return mi

# Compare from-scratch vs. scikit-learn
print("MI Comparison: From-Scratch vs. scikit-learn")
print("=" * 55)
print(f"{'Feature':<20} {'Scratch':>10} {'sklearn':>10}")
print("-" * 55)

features_encoded = features.copy()
cat_cols = features.select_dtypes(include=["object"]).columns
for col in cat_cols:
    features_encoded[col] = features_encoded[col].astype("category").cat.codes

mi_sklearn = mutual_info_classif(
    features_encoded.values, target,
    discrete_features=[col in cat_cols for col in features_encoded.columns],
    random_state=42, n_neighbors=5
)

for i, col in enumerate(features.columns):
    is_cat = col in cat_cols
    mi_scratch = mutual_information_binned(
        features[col].values if not is_cat else features_encoded[col].values,
        target,
        n_bins=20,
        categorical=is_cat
    )
    print(f"{col:<20} {mi_scratch:>10.4f} {mi_sklearn[i]:>10.4f}")

MI Comparison: From-Scratch vs. scikit-learn
=======================================================
Feature               Scratch    sklearn
-------------------------------------------------------
age                    0.0092     0.0091
days_since_signup      0.0068     0.0072
sessions_last_7d       0.0118     0.0112
avg_watch_pct          0.0049     0.0054
subscription_tier      0.0305     0.0307
primary_genre          0.0029     0.0031
primary_device         0.0038     0.0042
random_id_hash         0.0002     0.0001

The from-scratch implementation produces estimates consistent with scikit-learn. Small differences arise because the kNN estimator (scikit-learn) and the binning estimator (from scratch) use different density estimation strategies.

Practical Recommendations

Use MI as a complement to correlation, not a replacement. Correlation is fast, interpretable, and well-understood. MI is strictly more general but harder to interpret (what does "0.03 nats" mean in business terms?) and more sensitive to estimation parameters (number of bins, number of neighbors).
Always check for nonlinear relationships. When a feature has low correlation but high MI, create a binned engagement plot (as we did for age) to understand the shape of the relationship. This diagnostic takes 30 seconds and prevents you from dropping a genuinely informative feature.
For categorical features, MI is often more natural than correlation. Pearson correlation is not meaningful for unordered categorical variables. Cramér's V works but loses information about which categories drive the association. MI captures the full dependency structure.
In production, combine MI-based screening with model-based importance. MI measures univariate associations; it does not capture feature interactions. A practical pipeline: (a) drop features with near-zero MI (screening), (b) train the model on remaining features, (c) use SHAP or permutation importance for the final selection. This three-stage approach catches nonlinear univariate effects (stage a) and multivariate interactions (stage c).

Production Reality: StreamRec's production feature selection pipeline uses MI as a first-pass filter (removing features below a 0.001 nat threshold), followed by a gradient boosted tree model with permutation importance for final ranking. The MI filter catches one to three features per quarter that correlation-based screening would miss — typically categorical features with nonlinear effects. At StreamRec's scale (47 features, 2M examples), the MI computation adds approximately 45 seconds to the pipeline — a negligible cost for a weekly batch job.