Case Study 1: StreamFlow --- When 127 Features Became 14
Background
StreamFlow's data science team has been building toward this moment for three months. The SQL extraction pipeline (Chapter 5) pulls subscriber behavior from the data warehouse. The feature engineering pipeline (Chapter 6) transforms raw events into 25 base features. Categorical encoding (Chapter 7) adds 8 more. Missing data handling (Chapter 8) adds 6 missing indicators and fills gaps. Interaction terms and polynomial features push the count higher.
The total: 127 features for churn prediction.
The model --- a gradient boosted tree with 200 estimators --- trains in 47 minutes on the full 2.4 million subscriber-month dataset. It achieves an AUC of 0.836 on the holdout set. The retention team is satisfied. The engineering team is not.
"How are we supposed to monitor 127 features in production?" asks the ML engineer responsible for deployment. "Every feature is a dependency. Every dependency can break. Every break requires a page, an investigation, and a fix. With 127 features, something will break every week."
She is right. Feature selection is not just a modeling improvement --- it is a deployment requirement.
The Data
The 127 features break down into categories:
| Category | Count | Examples |
|---|---|---|
| Usage metrics (7d, 30d, 90d windows) | 18 | hours_last_7d, sessions_last_30d, avg_session_duration_90d |
| Engagement ratios | 12 | hours_per_tenure_month, sessions_per_device, genre_diversity |
| Temporal trends | 8 | engagement_trend_7d_30d, hours_trend_30d_90d |
| Account features | 6 | tenure_months, monthly_charge, plan_type_encoded |
| Behavioral indicators | 9 | used_mobile_app, used_search, used_recommendations |
| Communication features | 5 | email_open_rate, email_click_rate, nps_score |
| Missing indicators | 6 | hours_7d_missing, email_open_rate_missing, nps_missing |
| Interaction terms | 15 | tenure_x_charge, sessions_x_genre_diversity |
| Polynomial features | 12 | hours_7d_squared, tenure_squared |
| Categorical encodings | 8 | plan_type_basic, plan_type_premium, device_os_ios |
| Social features | 4 | referred_a_friend, shared_content_count |
| Support features | 3 | support_tickets_90d, avg_resolution_time, escalation_flag |
| Seasonal features | 6 | signup_month_sin, signup_month_cos, is_holiday_signup |
| Customer lifetime value proxies | 5 | total_revenue, avg_monthly_spend, discount_count |
| Content preference features | 10 | top_genre_drama, genre_count, binge_session_count |
| Total | 127 |
Phase 1: Filter Screening
The team starts with the fastest methods to eliminate obvious noise and redundancy.
Variance Threshold
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold
# Load the full feature matrix (127 features, 2.4M rows)
# Using a 50K sample for initial screening
np.random.seed(42)
n = 50000
# Simulate the feature matrix summary
# (In production, this is loaded from the feature store)
variance_results = {
'is_holiday_signup': 0.0023, # 99.8% are 0
'signup_month_cos': 0.2514,
'signup_month_sin': 0.2487,
'plan_type_family': 0.0098, # 99% are 0
'device_os_linux': 0.0034, # 99.7% are 0
'escalation_flag': 0.0087, # 99.1% are 0
}
selector = VarianceThreshold(threshold=0.01)
# selector.fit(X)
print("Features removed by VarianceThreshold (var < 0.01):")
print("=" * 55)
for feat, var in sorted(variance_results.items(), key=lambda x: x[1]):
if var < 0.01:
print(f" {feat:30s} var = {var:.4f}")
print(f"\nRemoved: 4 features")
print(f"Remaining: 123 features")
Features removed by VarianceThreshold (var < 0.01):
=======================================================
is_holiday_signup var = 0.0023
device_os_linux var = 0.0034
escalation_flag var = 0.0087
plan_type_family var = 0.0098
Removed: 4 features
Remaining: 123 features
Correlation-Based Redundancy Removal
import pandas as pd
import numpy as np
# Compute correlation matrix on the 123 remaining features
# Identify pairs with |r| > 0.85
high_corr_pairs = [
('hours_last_7d', 'sessions_last_7d', 0.94),
('hours_last_7d', 'hours_last_30d', 0.92),
('hours_last_30d', 'hours_last_90d', 0.96),
('sessions_last_7d', 'sessions_last_30d', 0.91),
('sessions_last_30d', 'sessions_last_90d', 0.95),
('hours_last_7d', 'avg_session_duration_7d', 0.88),
('hours_7d_squared', 'hours_last_7d', 0.97),
('tenure_squared', 'tenure_months', 0.95),
('email_open_rate', 'email_click_rate', 0.87),
('engagement_trend_7d_30d', 'hours_trend_7d_30d', 0.93),
('engagement_trend_30d_90d', 'hours_trend_30d_90d', 0.91),
('total_revenue', 'tenure_months', 0.89),
('total_revenue', 'avg_monthly_spend', 0.86),
('monthly_charge', 'avg_monthly_spend', 0.92),
]
print(f"Highly correlated pairs (|r| > 0.85): {len(high_corr_pairs)}")
print("=" * 70)
for f1, f2, corr in sorted(high_corr_pairs, key=lambda x: x[2], reverse=True):
print(f" {f1:35s} <-> {f2:35s} |r| = {corr:.2f}")
# Decision: from each cluster, keep the feature with highest
# univariate correlation with the churn target
features_to_drop_corr = [
'hours_last_30d', # keep hours_last_7d (more recent)
'hours_last_90d', # redundant with 30d and 7d
'sessions_last_30d', # keep sessions_last_7d
'sessions_last_90d', # redundant
'avg_session_duration_7d', # derivable from hours and sessions
'hours_7d_squared', # polynomial of a kept feature
'tenure_squared', # polynomial of a kept feature
'email_click_rate', # keep email_open_rate (more complete)
'hours_trend_7d_30d', # keep engagement_trend
'hours_trend_30d_90d', # redundant with engagement trend
'avg_monthly_spend', # keep monthly_charge (more direct)
'total_revenue', # highly correlated with tenure x charge
]
print(f"\nFeatures removed for redundancy: {len(features_to_drop_corr)}")
print(f"Remaining: {123 - len(features_to_drop_corr)} features")
Highly correlated pairs (|r| > 0.85): 14
======================================================================
hours_7d_squared <-> hours_last_7d |r| = 0.97
hours_last_30d <-> hours_last_90d |r| = 0.96
sessions_last_30d <-> sessions_last_90d |r| = 0.95
tenure_squared <-> tenure_months |r| = 0.95
hours_last_7d <-> sessions_last_7d |r| = 0.94
engagement_trend_7d_30d <-> hours_trend_7d_30d |r| = 0.93
hours_last_7d <-> hours_last_30d |r| = 0.92
monthly_charge <-> avg_monthly_spend |r| = 0.92
sessions_last_7d <-> sessions_last_30d |r| = 0.91
engagement_trend_30d_90d <-> hours_trend_30d_90d |r| = 0.91
total_revenue <-> tenure_months |r| = 0.89
hours_last_7d <-> avg_session_duration_7d |r| = 0.88
email_open_rate <-> email_click_rate |r| = 0.87
total_revenue <-> avg_monthly_spend |r| = 0.86
Features removed for redundancy: 12
Remaining: 111 features
VIF Analysis
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Compute VIF on the remaining 111 features
# (In practice, computed on a sample for speed)
# Summary of VIF results after correlation-based removal
vif_flagged = [
('sessions_last_7d', 8.7),
('avg_session_duration_30d', 7.2),
('avg_session_duration_90d', 6.9),
('engagement_trend_30d_90d', 5.4),
('signup_month_sin', 5.1),
('signup_month_cos', 5.3),
]
print("Features with VIF > 5 (after correlation-based removal):")
print("=" * 55)
for feat, vif in sorted(vif_flagged, key=lambda x: x[1], reverse=True):
label = "BIG PROBLEM" if vif > 10 else "Problem"
print(f" {feat:35s} VIF = {vif:.1f} {label}")
features_to_drop_vif = [
'avg_session_duration_30d', # Redundant with 7d version
'avg_session_duration_90d', # Redundant
'signup_month_sin', # Low predictive value, collinear with cos
]
print(f"\nFeatures removed by VIF: {len(features_to_drop_vif)}")
print(f"Remaining: {111 - len(features_to_drop_vif)} features")
Features with VIF > 5 (after correlation-based removal):
=======================================================
sessions_last_7d VIF = 8.7 Problem
avg_session_duration_30d VIF = 7.2 Problem
avg_session_duration_90d VIF = 6.9 Problem
signup_month_cos VIF = 5.3 Problem
engagement_trend_30d_90d VIF = 5.4 Problem
signup_month_sin VIF = 5.1 Problem
Features removed by VIF: 3
Remaining: 108 features
Filter screening summary: 127 features reduced to 108 in under 5 minutes of compute time. Nineteen features removed --- 4 for near-zero variance, 12 for redundancy, 3 for multicollinearity.
Phase 2: Embedded Selection
With 108 features remaining, the team applies L1 regularization to identify the features that survive penalized fitting.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
import numpy as np
# L1 logistic regression with cross-validated C selection
# Testing C = 0.001, 0.01, 0.1, 1.0
results_by_c = {
0.001: {'nonzero': 6, 'auc': 0.798},
0.01: {'nonzero': 14, 'auc': 0.827},
0.1: {'nonzero': 31, 'auc': 0.839},
1.0: {'nonzero': 52, 'auc': 0.835},
}
print("L1 Feature Selection: Regularization Sweep")
print("=" * 60)
print(f" {'C':>8s} {'Non-zero':>10s} {'CV AUC':>8s}")
print("-" * 60)
for c, res in results_by_c.items():
print(f" {c:8.3f} {res['nonzero']:10d} {res['auc']:8.3f}")
print(f"\nBest AUC at C=0.1 with 31 features.")
print(f"But C=0.01 with 14 features is within 1.2 AUC points.")
L1 Feature Selection: Regularization Sweep
============================================================
C Non-zero CV AUC
------------------------------------------------------------
0.001 6 0.798
0.010 14 0.827
0.100 31 0.839
1.000 52 0.835
Best AUC at C=0.1 with 31 features.
But C=0.01 with 14 features is within 1.2 AUC points.
The team examines the 14 features selected at C=0.01:
# Features selected by L1 at C=0.01 (14 features)
selected_features = [
('genre_diversity', -0.4812, 'Content engagement breadth'),
('support_tickets_90d', 0.3567, 'Customer frustration signal'),
('days_since_last_login', 0.3234, 'Recency of engagement'),
('hours_7d_missing', 0.2987, 'Missing usage = disengagement signal'),
('email_open_rate', -0.2456, 'Communication responsiveness'),
('engagement_trend_7d_30d', 0.2178, 'Deteriorating engagement'),
('sessions_last_7d', -0.1923, 'Recent activity volume'),
('referred_a_friend', -0.1756, 'Social commitment signal'),
('nps_missing', 0.1534, 'Survey non-response = disengagement'),
('plan_is_premium', -0.1345, 'Higher plan = higher switching cost'),
('tenure_months', -0.1123, 'Longer tenure = more habitual'),
('monthly_charge', 0.0945, 'Price sensitivity signal'),
('completion_rate', -0.0834, 'Content satisfaction'),
('binge_session_count', -0.0678, 'Deep engagement pattern'),
]
print("L1-Selected Features (C=0.01, 14 features):")
print("=" * 80)
print(f" {'Feature':30s} {'Coef':>8s} Interpretation")
print("-" * 80)
for feat, coef, interp in selected_features:
direction = "+" if coef > 0 else "-"
print(f" {feat:30s} {coef:+8.4f} {direction} churn: {interp}")
L1-Selected Features (C=0.01, 14 features):
================================================================================
Feature Coef Interpretation
--------------------------------------------------------------------------------
genre_diversity -0.4812 - churn: Content engagement breadth
support_tickets_90d +0.3567 + churn: Customer frustration signal
days_since_last_login +0.3234 + churn: Recency of engagement
hours_7d_missing +0.2987 + churn: Missing usage = disengagement signal
email_open_rate -0.2456 - churn: Communication responsiveness
engagement_trend_7d_30d +0.2178 + churn: Deteriorating engagement
sessions_last_7d -0.1923 - churn: Recent activity volume
referred_a_friend -0.1756 - churn: Social commitment signal
nps_missing +0.1534 + churn: Survey non-response = disengagement
plan_is_premium -0.1345 - churn: Higher plan = higher switching cost
tenure_months -0.1123 - churn: Longer tenure = more habitual
monthly_charge +0.0945 + churn: Price sensitivity signal
completion_rate -0.0834 - churn: Content satisfaction
binge_session_count -0.0678 - churn: Deep engagement pattern
Every selected feature has a clear business interpretation. The missing indicators (hours_7d_missing, nps_missing) survived selection --- confirming the Chapter 8 finding that missingness carries churn signal. The interaction terms and polynomial features were all eliminated; they added complexity without adding enough unique signal to survive L1 penalization.
Phase 3: Validation with Tree-Based Importance
The team validates the L1 selection using a different method family: gradient boosted tree importance.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# Train GBT on all 108 post-filter features
# Then check permutation importance on holdout set
# Permutation importance results (top 20)
perm_results = [
('genre_diversity', 0.0487, 0.0032),
('support_tickets_90d', 0.0412, 0.0028),
('days_since_last_login', 0.0378, 0.0031),
('hours_7d_missing', 0.0334, 0.0027),
('engagement_trend_7d_30d', 0.0289, 0.0024),
('email_open_rate', 0.0267, 0.0026),
('sessions_last_7d', 0.0234, 0.0021),
('referred_a_friend', 0.0198, 0.0019),
('nps_missing', 0.0178, 0.0022),
('plan_is_premium', 0.0156, 0.0018),
('tenure_months', 0.0134, 0.0016),
('binge_session_count', 0.0112, 0.0015),
('monthly_charge', 0.0098, 0.0014),
('completion_rate', 0.0087, 0.0013),
('used_mobile_app', 0.0045, 0.0012),
('device_count', 0.0034, 0.0011),
('signup_month_cos', 0.0023, 0.0010),
('shared_content_count', 0.0018, 0.0009),
('discount_count', 0.0012, 0.0008),
('genre_count', 0.0008, 0.0009),
]
print("Permutation Importance (top 20, holdout set):")
print("=" * 70)
print(f" {'Rank':>4s} {'Feature':30s} {'Mean Drop':>10s} {'Std':>8s} L1-Selected?")
print("-" * 70)
for i, (feat, mean, std) in enumerate(perm_results, 1):
l1_selected = "YES" if feat in [f[0] for f in selected_features] else "no"
print(f" {i:4d} {feat:30s} {mean:10.4f} {std:8.4f} {l1_selected}")
Permutation Importance (top 20, holdout set):
======================================================================
Rank Feature Mean Drop Std L1-Selected?
----------------------------------------------------------------------
1 genre_diversity 0.0487 0.0032 YES
2 support_tickets_90d 0.0412 0.0028 YES
3 days_since_last_login 0.0378 0.0031 YES
4 hours_7d_missing 0.0334 0.0027 YES
5 engagement_trend_7d_30d 0.0289 0.0024 YES
6 email_open_rate 0.0267 0.0026 YES
7 sessions_last_7d 0.0234 0.0021 YES
8 referred_a_friend 0.0198 0.0019 YES
9 nps_missing 0.0178 0.0022 YES
10 plan_is_premium 0.0156 0.0018 YES
11 tenure_months 0.0134 0.0016 YES
12 binge_session_count 0.0112 0.0015 YES
13 monthly_charge 0.0098 0.0014 YES
14 completion_rate 0.0087 0.0013 YES
15 used_mobile_app 0.0045 0.0012 no
16 device_count 0.0034 0.0011 no
17 signup_month_cos 0.0023 0.0010 no
18 shared_content_count 0.0018 0.0009 no
19 discount_count 0.0012 0.0008 no
20 genre_count 0.0008 0.0009 no
The two methods agree almost perfectly: the top 14 features by permutation importance are the same 14 features selected by L1 regularization. This convergence across method families is strong evidence that the selection is robust and not an artifact of a single method.
Phase 4: Final Comparison
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import time
# Results from the team's final comparison
comparison = [
('All 127 features', 127, 0.836, 0.0071, 47.2),
('After filter (108)', 108, 0.841, 0.0065, 38.6),
('L1-selected (14)', 14, 0.851, 0.0048, 3.2),
('Top 10 by perm. imp.', 10, 0.847, 0.0052, 2.1),
('Top 6 by perm. imp.', 6, 0.823, 0.0058, 1.4),
]
print("StreamFlow Churn Model: Feature Selection Comparison")
print("=" * 80)
print(f" {'Configuration':25s} {'Features':>8s} {'AUC':>6s} {'Std':>6s} {'Train Time':>12s}")
print("-" * 80)
for name, n_feat, auc, std, train_time in comparison:
print(f" {name:25s} {n_feat:8d} {auc:.3f} {std:.4f} {train_time:9.1f} min")
StreamFlow Churn Model: Feature Selection Comparison
================================================================================
Configuration Features AUC Std Train Time
--------------------------------------------------------------------------------
All 127 features 127 0.836 0.0071 47.2 min
After filter (108) 108 0.841 0.0065 38.6 min
L1-selected (14) 14 0.851 0.0048 3.2 min
Top 10 by perm. imp. 10 0.847 0.0052 2.1 min
Top 6 by perm. imp. 6 0.823 0.0058 1.4 min
The Decision
The team selects the 14-feature L1 model. The results:
- AUC improved by 1.5 points (0.836 to 0.851) --- removing noise helped the model
- Training time dropped 93% (47 minutes to 3.2 minutes)
- Feature count dropped 89% (127 to 14)
- Cross-validation variance dropped 32% --- the model is more stable
- Every feature has a business interpretation --- the retention team can understand and act on the predictions
- Monitoring burden dropped dramatically --- 14 features to track instead of 127
The ML engineer who raised the monitoring concern is satisfied: "Fourteen features I can monitor. Fourteen dependencies I can maintain. This is a model I can actually deploy."
Lessons
-
More features is not more better. The 14-feature model beat the 127-feature model on every metric: accuracy, stability, training speed, interpretability, and deployability. The additional 113 features were not just unhelpful --- they were actively harmful.
-
Use method families in sequence. Filter methods (5 minutes of compute) eliminated 19 features. Embedded methods (30 minutes of compute) eliminated 94 more. Each method is appropriate at its scale.
-
Validate across method families. The L1 selection and permutation importance agreed on the top 14 features. If they had disagreed, that would have signaled instability in the selection and required further investigation.
-
Missing indicators survived selection. Two of the 14 selected features are missing indicators from Chapter 8. Feature selection validated the Chapter 8 insight: missingness carries signal.
-
Feature selection is a deployment enabler. The data science team cares about AUC. The engineering team cares about operational complexity. Feature selection improved both simultaneously.
This case study demonstrates the complete feature selection workflow for a production churn model. Return to the chapter for the underlying methods.