Case Study 1: StreamFlow --- When 127 Features Became 14


Background

StreamFlow's data science team has been building toward this moment for three months. The SQL extraction pipeline (Chapter 5) pulls subscriber behavior from the data warehouse. The feature engineering pipeline (Chapter 6) transforms raw events into 25 base features. Categorical encoding (Chapter 7) adds 8 more. Missing data handling (Chapter 8) adds 6 missing indicators and fills gaps. Interaction terms and polynomial features push the count higher.

The total: 127 features for churn prediction.

The model --- a gradient boosted tree with 200 estimators --- trains in 47 minutes on the full 2.4 million subscriber-month dataset. It achieves an AUC of 0.836 on the holdout set. The retention team is satisfied. The engineering team is not.

"How are we supposed to monitor 127 features in production?" asks the ML engineer responsible for deployment. "Every feature is a dependency. Every dependency can break. Every break requires a page, an investigation, and a fix. With 127 features, something will break every week."

She is right. Feature selection is not just a modeling improvement --- it is a deployment requirement.


The Data

The 127 features break down into categories:

Category Count Examples
Usage metrics (7d, 30d, 90d windows) 18 hours_last_7d, sessions_last_30d, avg_session_duration_90d
Engagement ratios 12 hours_per_tenure_month, sessions_per_device, genre_diversity
Temporal trends 8 engagement_trend_7d_30d, hours_trend_30d_90d
Account features 6 tenure_months, monthly_charge, plan_type_encoded
Behavioral indicators 9 used_mobile_app, used_search, used_recommendations
Communication features 5 email_open_rate, email_click_rate, nps_score
Missing indicators 6 hours_7d_missing, email_open_rate_missing, nps_missing
Interaction terms 15 tenure_x_charge, sessions_x_genre_diversity
Polynomial features 12 hours_7d_squared, tenure_squared
Categorical encodings 8 plan_type_basic, plan_type_premium, device_os_ios
Social features 4 referred_a_friend, shared_content_count
Support features 3 support_tickets_90d, avg_resolution_time, escalation_flag
Seasonal features 6 signup_month_sin, signup_month_cos, is_holiday_signup
Customer lifetime value proxies 5 total_revenue, avg_monthly_spend, discount_count
Content preference features 10 top_genre_drama, genre_count, binge_session_count
Total 127

Phase 1: Filter Screening

The team starts with the fastest methods to eliminate obvious noise and redundancy.

Variance Threshold

import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold

# Load the full feature matrix (127 features, 2.4M rows)
# Using a 50K sample for initial screening
np.random.seed(42)
n = 50000

# Simulate the feature matrix summary
# (In production, this is loaded from the feature store)
variance_results = {
    'is_holiday_signup': 0.0023,       # 99.8% are 0
    'signup_month_cos': 0.2514,
    'signup_month_sin': 0.2487,
    'plan_type_family': 0.0098,        # 99% are 0
    'device_os_linux': 0.0034,         # 99.7% are 0
    'escalation_flag': 0.0087,         # 99.1% are 0
}

selector = VarianceThreshold(threshold=0.01)
# selector.fit(X)

print("Features removed by VarianceThreshold (var < 0.01):")
print("=" * 55)
for feat, var in sorted(variance_results.items(), key=lambda x: x[1]):
    if var < 0.01:
        print(f"  {feat:30s}  var = {var:.4f}")

print(f"\nRemoved: 4 features")
print(f"Remaining: 123 features")
Features removed by VarianceThreshold (var < 0.01):
=======================================================
  is_holiday_signup               var = 0.0023
  device_os_linux                 var = 0.0034
  escalation_flag                 var = 0.0087
  plan_type_family                var = 0.0098

Removed: 4 features
Remaining: 123 features

Correlation-Based Redundancy Removal

import pandas as pd
import numpy as np

# Compute correlation matrix on the 123 remaining features
# Identify pairs with |r| > 0.85

high_corr_pairs = [
    ('hours_last_7d', 'sessions_last_7d', 0.94),
    ('hours_last_7d', 'hours_last_30d', 0.92),
    ('hours_last_30d', 'hours_last_90d', 0.96),
    ('sessions_last_7d', 'sessions_last_30d', 0.91),
    ('sessions_last_30d', 'sessions_last_90d', 0.95),
    ('hours_last_7d', 'avg_session_duration_7d', 0.88),
    ('hours_7d_squared', 'hours_last_7d', 0.97),
    ('tenure_squared', 'tenure_months', 0.95),
    ('email_open_rate', 'email_click_rate', 0.87),
    ('engagement_trend_7d_30d', 'hours_trend_7d_30d', 0.93),
    ('engagement_trend_30d_90d', 'hours_trend_30d_90d', 0.91),
    ('total_revenue', 'tenure_months', 0.89),
    ('total_revenue', 'avg_monthly_spend', 0.86),
    ('monthly_charge', 'avg_monthly_spend', 0.92),
]

print(f"Highly correlated pairs (|r| > 0.85): {len(high_corr_pairs)}")
print("=" * 70)
for f1, f2, corr in sorted(high_corr_pairs, key=lambda x: x[2], reverse=True):
    print(f"  {f1:35s} <-> {f2:35s}  |r| = {corr:.2f}")

# Decision: from each cluster, keep the feature with highest
# univariate correlation with the churn target
features_to_drop_corr = [
    'hours_last_30d',           # keep hours_last_7d (more recent)
    'hours_last_90d',           # redundant with 30d and 7d
    'sessions_last_30d',        # keep sessions_last_7d
    'sessions_last_90d',        # redundant
    'avg_session_duration_7d',  # derivable from hours and sessions
    'hours_7d_squared',         # polynomial of a kept feature
    'tenure_squared',           # polynomial of a kept feature
    'email_click_rate',         # keep email_open_rate (more complete)
    'hours_trend_7d_30d',       # keep engagement_trend
    'hours_trend_30d_90d',      # redundant with engagement trend
    'avg_monthly_spend',        # keep monthly_charge (more direct)
    'total_revenue',            # highly correlated with tenure x charge
]

print(f"\nFeatures removed for redundancy: {len(features_to_drop_corr)}")
print(f"Remaining: {123 - len(features_to_drop_corr)} features")
Highly correlated pairs (|r| > 0.85): 14
======================================================================
  hours_7d_squared                  <-> hours_last_7d                    |r| = 0.97
  hours_last_30d                    <-> hours_last_90d                   |r| = 0.96
  sessions_last_30d                 <-> sessions_last_90d                |r| = 0.95
  tenure_squared                    <-> tenure_months                    |r| = 0.95
  hours_last_7d                     <-> sessions_last_7d                 |r| = 0.94
  engagement_trend_7d_30d           <-> hours_trend_7d_30d               |r| = 0.93
  hours_last_7d                     <-> hours_last_30d                   |r| = 0.92
  monthly_charge                    <-> avg_monthly_spend                |r| = 0.92
  sessions_last_7d                  <-> sessions_last_30d                |r| = 0.91
  engagement_trend_30d_90d          <-> hours_trend_30d_90d              |r| = 0.91
  total_revenue                     <-> tenure_months                    |r| = 0.89
  hours_last_7d                     <-> avg_session_duration_7d          |r| = 0.88
  email_open_rate                   <-> email_click_rate                 |r| = 0.87
  total_revenue                     <-> avg_monthly_spend                |r| = 0.86

Features removed for redundancy: 12
Remaining: 111 features

VIF Analysis

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Compute VIF on the remaining 111 features
# (In practice, computed on a sample for speed)

# Summary of VIF results after correlation-based removal
vif_flagged = [
    ('sessions_last_7d', 8.7),
    ('avg_session_duration_30d', 7.2),
    ('avg_session_duration_90d', 6.9),
    ('engagement_trend_30d_90d', 5.4),
    ('signup_month_sin', 5.1),
    ('signup_month_cos', 5.3),
]

print("Features with VIF > 5 (after correlation-based removal):")
print("=" * 55)
for feat, vif in sorted(vif_flagged, key=lambda x: x[1], reverse=True):
    label = "BIG PROBLEM" if vif > 10 else "Problem"
    print(f"  {feat:35s}  VIF = {vif:.1f}  {label}")

features_to_drop_vif = [
    'avg_session_duration_30d',   # Redundant with 7d version
    'avg_session_duration_90d',   # Redundant
    'signup_month_sin',           # Low predictive value, collinear with cos
]

print(f"\nFeatures removed by VIF: {len(features_to_drop_vif)}")
print(f"Remaining: {111 - len(features_to_drop_vif)} features")
Features with VIF > 5 (after correlation-based removal):
=======================================================
  sessions_last_7d                   VIF = 8.7  Problem
  avg_session_duration_30d           VIF = 7.2  Problem
  avg_session_duration_90d           VIF = 6.9  Problem
  signup_month_cos                   VIF = 5.3  Problem
  engagement_trend_30d_90d           VIF = 5.4  Problem
  signup_month_sin                   VIF = 5.1  Problem

Features removed by VIF: 3
Remaining: 108 features

Filter screening summary: 127 features reduced to 108 in under 5 minutes of compute time. Nineteen features removed --- 4 for near-zero variance, 12 for redundancy, 3 for multicollinearity.


Phase 2: Embedded Selection

With 108 features remaining, the team applies L1 regularization to identify the features that survive penalized fitting.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
import numpy as np

# L1 logistic regression with cross-validated C selection
# Testing C = 0.001, 0.01, 0.1, 1.0

results_by_c = {
    0.001: {'nonzero': 6, 'auc': 0.798},
    0.01: {'nonzero': 14, 'auc': 0.827},
    0.1: {'nonzero': 31, 'auc': 0.839},
    1.0: {'nonzero': 52, 'auc': 0.835},
}

print("L1 Feature Selection: Regularization Sweep")
print("=" * 60)
print(f"  {'C':>8s}  {'Non-zero':>10s}  {'CV AUC':>8s}")
print("-" * 60)
for c, res in results_by_c.items():
    print(f"  {c:8.3f}  {res['nonzero']:10d}  {res['auc']:8.3f}")

print(f"\nBest AUC at C=0.1 with 31 features.")
print(f"But C=0.01 with 14 features is within 1.2 AUC points.")
L1 Feature Selection: Regularization Sweep
============================================================
         C    Non-zero    CV AUC
------------------------------------------------------------
     0.001           6     0.798
     0.010          14     0.827
     0.100          31     0.839
     1.000          52     0.835

Best AUC at C=0.1 with 31 features.
But C=0.01 with 14 features is within 1.2 AUC points.

The team examines the 14 features selected at C=0.01:

# Features selected by L1 at C=0.01 (14 features)
selected_features = [
    ('genre_diversity', -0.4812, 'Content engagement breadth'),
    ('support_tickets_90d', 0.3567, 'Customer frustration signal'),
    ('days_since_last_login', 0.3234, 'Recency of engagement'),
    ('hours_7d_missing', 0.2987, 'Missing usage = disengagement signal'),
    ('email_open_rate', -0.2456, 'Communication responsiveness'),
    ('engagement_trend_7d_30d', 0.2178, 'Deteriorating engagement'),
    ('sessions_last_7d', -0.1923, 'Recent activity volume'),
    ('referred_a_friend', -0.1756, 'Social commitment signal'),
    ('nps_missing', 0.1534, 'Survey non-response = disengagement'),
    ('plan_is_premium', -0.1345, 'Higher plan = higher switching cost'),
    ('tenure_months', -0.1123, 'Longer tenure = more habitual'),
    ('monthly_charge', 0.0945, 'Price sensitivity signal'),
    ('completion_rate', -0.0834, 'Content satisfaction'),
    ('binge_session_count', -0.0678, 'Deep engagement pattern'),
]

print("L1-Selected Features (C=0.01, 14 features):")
print("=" * 80)
print(f"  {'Feature':30s}  {'Coef':>8s}  Interpretation")
print("-" * 80)
for feat, coef, interp in selected_features:
    direction = "+" if coef > 0 else "-"
    print(f"  {feat:30s}  {coef:+8.4f}  {direction} churn: {interp}")
L1-Selected Features (C=0.01, 14 features):
================================================================================
  Feature                          Coef  Interpretation
--------------------------------------------------------------------------------
  genre_diversity                 -0.4812  - churn: Content engagement breadth
  support_tickets_90d             +0.3567  + churn: Customer frustration signal
  days_since_last_login           +0.3234  + churn: Recency of engagement
  hours_7d_missing                +0.2987  + churn: Missing usage = disengagement signal
  email_open_rate                 -0.2456  - churn: Communication responsiveness
  engagement_trend_7d_30d         +0.2178  + churn: Deteriorating engagement
  sessions_last_7d                -0.1923  - churn: Recent activity volume
  referred_a_friend               -0.1756  - churn: Social commitment signal
  nps_missing                     +0.1534  + churn: Survey non-response = disengagement
  plan_is_premium                 -0.1345  - churn: Higher plan = higher switching cost
  tenure_months                   -0.1123  - churn: Longer tenure = more habitual
  monthly_charge                  +0.0945  + churn: Price sensitivity signal
  completion_rate                 -0.0834  - churn: Content satisfaction
  binge_session_count             -0.0678  - churn: Deep engagement pattern

Every selected feature has a clear business interpretation. The missing indicators (hours_7d_missing, nps_missing) survived selection --- confirming the Chapter 8 finding that missingness carries churn signal. The interaction terms and polynomial features were all eliminated; they added complexity without adding enough unique signal to survive L1 penalization.


Phase 3: Validation with Tree-Based Importance

The team validates the L1 selection using a different method family: gradient boosted tree importance.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Train GBT on all 108 post-filter features
# Then check permutation importance on holdout set

# Permutation importance results (top 20)
perm_results = [
    ('genre_diversity', 0.0487, 0.0032),
    ('support_tickets_90d', 0.0412, 0.0028),
    ('days_since_last_login', 0.0378, 0.0031),
    ('hours_7d_missing', 0.0334, 0.0027),
    ('engagement_trend_7d_30d', 0.0289, 0.0024),
    ('email_open_rate', 0.0267, 0.0026),
    ('sessions_last_7d', 0.0234, 0.0021),
    ('referred_a_friend', 0.0198, 0.0019),
    ('nps_missing', 0.0178, 0.0022),
    ('plan_is_premium', 0.0156, 0.0018),
    ('tenure_months', 0.0134, 0.0016),
    ('binge_session_count', 0.0112, 0.0015),
    ('monthly_charge', 0.0098, 0.0014),
    ('completion_rate', 0.0087, 0.0013),
    ('used_mobile_app', 0.0045, 0.0012),
    ('device_count', 0.0034, 0.0011),
    ('signup_month_cos', 0.0023, 0.0010),
    ('shared_content_count', 0.0018, 0.0009),
    ('discount_count', 0.0012, 0.0008),
    ('genre_count', 0.0008, 0.0009),
]

print("Permutation Importance (top 20, holdout set):")
print("=" * 70)
print(f"  {'Rank':>4s}  {'Feature':30s}  {'Mean Drop':>10s}  {'Std':>8s}  L1-Selected?")
print("-" * 70)
for i, (feat, mean, std) in enumerate(perm_results, 1):
    l1_selected = "YES" if feat in [f[0] for f in selected_features] else "no"
    print(f"  {i:4d}  {feat:30s}  {mean:10.4f}  {std:8.4f}  {l1_selected}")
Permutation Importance (top 20, holdout set):
======================================================================
  Rank  Feature                         Mean Drop       Std  L1-Selected?
----------------------------------------------------------------------
     1  genre_diversity                    0.0487    0.0032  YES
     2  support_tickets_90d                0.0412    0.0028  YES
     3  days_since_last_login              0.0378    0.0031  YES
     4  hours_7d_missing                   0.0334    0.0027  YES
     5  engagement_trend_7d_30d            0.0289    0.0024  YES
     6  email_open_rate                    0.0267    0.0026  YES
     7  sessions_last_7d                   0.0234    0.0021  YES
     8  referred_a_friend                  0.0198    0.0019  YES
     9  nps_missing                        0.0178    0.0022  YES
    10  plan_is_premium                    0.0156    0.0018  YES
    11  tenure_months                      0.0134    0.0016  YES
    12  binge_session_count                0.0112    0.0015  YES
    13  monthly_charge                     0.0098    0.0014  YES
    14  completion_rate                    0.0087    0.0013  YES
    15  used_mobile_app                    0.0045    0.0012  no
    16  device_count                       0.0034    0.0011  no
    17  signup_month_cos                   0.0023    0.0010  no
    18  shared_content_count               0.0018    0.0009  no
    19  discount_count                     0.0012    0.0008  no
    20  genre_count                        0.0008    0.0009  no

The two methods agree almost perfectly: the top 14 features by permutation importance are the same 14 features selected by L1 regularization. This convergence across method families is strong evidence that the selection is robust and not an artifact of a single method.


Phase 4: Final Comparison

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import time

# Results from the team's final comparison
comparison = [
    ('All 127 features', 127, 0.836, 0.0071, 47.2),
    ('After filter (108)', 108, 0.841, 0.0065, 38.6),
    ('L1-selected (14)', 14, 0.851, 0.0048, 3.2),
    ('Top 10 by perm. imp.', 10, 0.847, 0.0052, 2.1),
    ('Top 6 by perm. imp.', 6, 0.823, 0.0058, 1.4),
]

print("StreamFlow Churn Model: Feature Selection Comparison")
print("=" * 80)
print(f"  {'Configuration':25s}  {'Features':>8s}  {'AUC':>6s}  {'Std':>6s}  {'Train Time':>12s}")
print("-" * 80)
for name, n_feat, auc, std, train_time in comparison:
    print(f"  {name:25s}  {n_feat:8d}  {auc:.3f}  {std:.4f}  {train_time:9.1f} min")
StreamFlow Churn Model: Feature Selection Comparison
================================================================================
  Configuration              Features     AUC     Std    Train Time
--------------------------------------------------------------------------------
  All 127 features                127   0.836  0.0071         47.2 min
  After filter (108)              108   0.841  0.0065         38.6 min
  L1-selected (14)                 14   0.851  0.0048          3.2 min
  Top 10 by perm. imp.             10   0.847  0.0052          2.1 min
  Top 6 by perm. imp.               6   0.823  0.0058          1.4 min

The Decision

The team selects the 14-feature L1 model. The results:

  • AUC improved by 1.5 points (0.836 to 0.851) --- removing noise helped the model
  • Training time dropped 93% (47 minutes to 3.2 minutes)
  • Feature count dropped 89% (127 to 14)
  • Cross-validation variance dropped 32% --- the model is more stable
  • Every feature has a business interpretation --- the retention team can understand and act on the predictions
  • Monitoring burden dropped dramatically --- 14 features to track instead of 127

The ML engineer who raised the monitoring concern is satisfied: "Fourteen features I can monitor. Fourteen dependencies I can maintain. This is a model I can actually deploy."


Lessons

  1. More features is not more better. The 14-feature model beat the 127-feature model on every metric: accuracy, stability, training speed, interpretability, and deployability. The additional 113 features were not just unhelpful --- they were actively harmful.

  2. Use method families in sequence. Filter methods (5 minutes of compute) eliminated 19 features. Embedded methods (30 minutes of compute) eliminated 94 more. Each method is appropriate at its scale.

  3. Validate across method families. The L1 selection and permutation importance agreed on the top 14 features. If they had disagreed, that would have signaled instability in the selection and required further investigation.

  4. Missing indicators survived selection. Two of the 14 selected features are missing indicators from Chapter 8. Feature selection validated the Chapter 8 insight: missingness carries signal.

  5. Feature selection is a deployment enabler. The data science team cares about AUC. The engineering team cares about operational complexity. Feature selection improved both simultaneously.


This case study demonstrates the complete feature selection workflow for a production churn model. Return to the chapter for the underlying methods.