Case Study 1: StreamFlow Feature Engineering Workshop

DataField.Dev

Case Study 1: StreamFlow Feature Engineering Workshop

Background

StreamFlow's churn rate has been hovering at 8.2% monthly for three quarters. The data science team has a clean dataset (extracted in Chapter 5's SQL milestone), a well-framed binary classification problem (Chapter 1), and a clear business mandate: identify subscribers likely to churn within 30 days so the retention team can intervene.

The team's first model --- a gradient boosted tree trained on five raw features (total_hours_watched, support_ticket_count, num_devices, plan_price, account_age_days) --- achieved an AUC of 0.71. The VP of Product called it "interesting but not actionable." The retention team said they could not prioritize outreach based on a model that barely outperformed their existing rule (flag anyone who has not logged in for 14 days).

The team's challenge: get the AUC above 0.80 without changing the algorithm. The constraint is deliberate --- it forces the improvement to come from features, not from hyperparameter tuning or architecture search.

The Data

The subscriber dataset contains 2.4 million active subscribers as of January 31, 2025. The events table contains timestamped records of every login, watch session, support ticket, plan change, and billing event.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, classification_report

# Load data
subscribers = pd.read_csv('streamflow_subscribers.csv',
                          parse_dates=['signup_date', 'last_login_date',
                                       'last_ticket_date', 'last_plan_change_date'])
events = pd.read_csv('streamflow_events.csv', parse_dates=['event_date'])

prediction_date = pd.Timestamp('2025-01-31')
target = subscribers['churned_within_30_days']

print(f"Subscribers: {len(subscribers):,}")
print(f"Events: {len(events):,}")
print(f"Churn rate: {target.mean():.3f}")
print(f"Churners: {target.sum():,}")

Subscribers: 2,400,000
Events: 87,432,160
Churn rate: 0.082
Churners: 196,800

Phase 1: Baseline with Raw Features

# Raw features --- minimal transformation
raw_features = subscribers[['total_hours_watched', 'support_ticket_count',
                            'num_devices']].copy()
raw_features['account_age_days'] = (
    prediction_date - subscribers['signup_date']
).dt.days
raw_features['plan_price'] = subscribers['plan_price']

X_train, X_test, y_train, y_test = train_test_split(
    raw_features, target, test_size=0.2, random_state=42, stratify=target
)

gbt_baseline = GradientBoostingClassifier(
    n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42
)
gbt_baseline.fit(X_train, y_train)
baseline_auc = roc_auc_score(y_test, gbt_baseline.predict_proba(X_test)[:, 1])
print(f"Baseline AUC: {baseline_auc:.4f}")

Baseline AUC: 0.7108

Phase 2: The Domain Knowledge Session

Before writing code, the team held a 60-minute session with three domain experts: a customer success manager, a product manager, and a billing engineer. They asked one question: "When you look at a subscriber's account, what tells you they are about to leave?"

The answers:

Customer success manager: - "They stop logging in. The silence is deafening." - "They file a support ticket and we do not resolve it quickly --- or they file multiple tickets in a short window." - "Their usage drops off a cliff. Not a gradual decline --- a sudden stop." - "They were on the annual plan and switched to monthly. That is the exit ramp."

Product manager: - "They only use one feature of the platform. The ones who explore multiple genres and tools stick around." - "New users who do not engage in the first two weeks are gone by month two." - "Multi-device users are stickier. If they have it on their phone and their laptop, they are invested."

Billing engineer: - "Failed payment retries. If the card declines and they do not update it within 48 hours, they are mentally gone." - "Users who signed up during a promotion churn at 2x the rate once the promotional price expires."

Every one of these insights translates directly into a feature.

Phase 3: Engineering Features from Domain Insights

Recency and Engagement

features = pd.DataFrame(index=subscribers.index)

# === Insight: "They stop logging in" ===
features['days_since_last_login'] = (
    prediction_date - subscribers['last_login_date']
).dt.days

features['active_last_7d'] = (features['days_since_last_login'] <= 7).astype(int)
features['active_last_14d'] = (features['days_since_last_login'] <= 14).astype(int)

# === Insight: "New users who don't engage in the first two weeks" ===
features['tenure_months'] = (
    (prediction_date - subscribers['signup_date']).dt.days / 30.44
).round(1)
features['is_first_90_days'] = (features['tenure_months'] < 3).astype(int)

Usage Patterns

# === Insight: "Usage drops off a cliff" ===
for window_days, label in [(7, '7d'), (30, '30d'), (90, '90d')]:
    cutoff = prediction_date - pd.Timedelta(days=window_days)
    window_events = events[
        (events['event_date'] >= cutoff) &
        (events['event_date'] < prediction_date) &
        (events['event_type'] == 'watch')
    ]

    agg = window_events.groupby('subscriber_id').agg(
        sessions=('event_date', 'nunique'),
        hours=('hours_watched', 'sum')
    )

    features[f'sessions_last_{label}'] = agg['sessions'].reindex(
        subscribers.index, fill_value=0
    )
    features[f'hours_last_{label}'] = agg['hours'].reindex(
        subscribers.index, fill_value=0
    )

# Trend: compare recent 30 days to prior 30 days
recent_30d_start = prediction_date - pd.Timedelta(days=30)
prior_30d_start = prediction_date - pd.Timedelta(days=60)

recent_hours = events[
    (events['event_date'] >= recent_30d_start) &
    (events['event_date'] < prediction_date) &
    (events['event_type'] == 'watch')
].groupby('subscriber_id')['hours_watched'].sum()

prior_hours = events[
    (events['event_date'] >= prior_30d_start) &
    (events['event_date'] < recent_30d_start) &
    (events['event_type'] == 'watch')
].groupby('subscriber_id')['hours_watched'].sum()

features['hours_change_30d'] = (
    recent_hours.reindex(subscribers.index, fill_value=0) -
    prior_hours.reindex(subscribers.index, fill_value=0)
)
features['usage_declining'] = (features['hours_change_30d'] < -2).astype(int)
features['usage_cliff'] = (features['hours_change_30d'] < -10).astype(int)

Support and Frustration Signals

# === Insight: "Multiple tickets in a short window" ===
cutoff_90d = prediction_date - pd.Timedelta(days=90)
cutoff_30d = prediction_date - pd.Timedelta(days=30)

ticket_events = events[events['event_type'] == 'support_ticket']

tickets_90d = ticket_events[
    ticket_events['event_date'] >= cutoff_90d
].groupby('subscriber_id').size()

tickets_30d = ticket_events[
    ticket_events['event_date'] >= cutoff_30d
].groupby('subscriber_id').size()

features['support_tickets_last_90d'] = tickets_90d.reindex(
    subscribers.index, fill_value=0
)
features['support_tickets_last_30d'] = tickets_30d.reindex(
    subscribers.index, fill_value=0
)

# Ratio: ticket intensity relative to usage
features['tickets_per_hour_90d'] = np.where(
    features['hours_last_90d'] > 0,
    features['support_tickets_last_90d'] / features['hours_last_90d'],
    features['support_tickets_last_90d']
)

# === Insight: "They filed a ticket and we didn't resolve it quickly" ===
features['days_since_last_ticket'] = (
    prediction_date - subscribers['last_ticket_date']
).dt.days.fillna(9999)  # Never filed a ticket → large value

features['recent_unresolved_ticket'] = (
    (features['days_since_last_ticket'] <= 14) &
    (features['support_tickets_last_30d'] >= 2)
).astype(int)

Product Engagement Depth

# === Insight: "Only use one feature of the platform" ===
genre_events = events[
    (events['event_date'] >= cutoff_90d) &
    (events['event_type'] == 'watch')
]
genres_used = genre_events.groupby('subscriber_id')['genre'].nunique()
features['genres_last_90d'] = genres_used.reindex(subscribers.index, fill_value=0)

features['genre_diversity_score'] = np.where(
    features['sessions_last_90d'] > 0,
    features['genres_last_90d'] / features['sessions_last_90d'],
    0
)

# === Insight: "Multi-device users are stickier" ===
features['device_count'] = subscribers['num_devices']
features['is_multi_device'] = (subscribers['num_devices'] > 1).astype(int)

Plan and Billing Signals

# === Insight: "Switched from annual to monthly --- that's the exit ramp" ===
features['plan_changes'] = subscribers['plan_change_count']
features['has_downgraded'] = subscribers['has_downgraded'].astype(int)
features['days_since_plan_change'] = (
    prediction_date - subscribers['last_plan_change_date']
).dt.days.fillna(9999)

features['recent_downgrade'] = (
    (features['has_downgraded'] == 1) &
    (features['days_since_plan_change'] <= 30)
).astype(int)

# === Insight: "Failed payment retries" ===
features['billing_failures'] = subscribers['billing_failures']
features['has_recent_billing_failure'] = (
    subscribers['days_since_last_billing_failure'] <= 14
).fillna(False).astype(int)

# === Insight: "Signed up during a promotion" ===
features['signup_during_promo'] = subscribers['promo_signup'].astype(int)
features['promo_expired'] = (
    (features['signup_during_promo'] == 1) &
    (features['tenure_months'] > subscribers['promo_duration_months'])
).astype(int)

Interaction Features

# Compound risk signals
features['new_and_inactive'] = (
    (features['is_first_90_days'] == 1) &
    (features['hours_last_30d'] == 0)
).astype(int)

features['declining_with_tickets'] = (
    (features['usage_declining'] == 1) &
    (features['support_tickets_last_30d'] >= 1)
).astype(int)

features['single_device_single_genre'] = (
    (features['device_count'] == 1) &
    (features['genres_last_90d'] <= 1)
).astype(int)

print(f"Total features engineered: {features.shape[1]}")

Total features engineered: 34

Phase 4: Results

# Select numeric features only
numeric_features = features.select_dtypes(include=[np.number])

X_train_eng, X_test_eng, y_train, y_test = train_test_split(
    numeric_features, target, test_size=0.2, random_state=42, stratify=target
)

# Same model, same hyperparameters --- only the features changed
gbt_engineered = GradientBoostingClassifier(
    n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42
)
gbt_engineered.fit(X_train_eng, y_train)
engineered_auc = roc_auc_score(
    y_test, gbt_engineered.predict_proba(X_test_eng)[:, 1]
)

print(f"Baseline AUC (5 raw features):      {baseline_auc:.4f}")
print(f"Engineered AUC (34 features):        {engineered_auc:.4f}")
print(f"Improvement:                         +{engineered_auc - baseline_auc:.4f}")

Baseline AUC (5 raw features):      0.7108
Engineered AUC (34 features):        0.8347
Improvement:                         +0.1239

A 12-point AUC improvement, same algorithm, same hyperparameters. The only change was the features.

Feature Importance

importances = pd.Series(
    gbt_engineered.feature_importances_,
    index=numeric_features.columns
).sort_values(ascending=False)

print("Top 15 features by importance:")
for i, (feature, importance) in enumerate(importances.head(15).items(), 1):
    print(f"  {i:2d}. {feature:35s}  {importance:.4f}")

Top 15 features by importance:
   1. days_since_last_login              0.1842
   2. hours_change_30d                   0.0987
   3. hours_last_30d                     0.0834
   4. sessions_last_7d                   0.0721
   5. tenure_months                      0.0698
   6. support_tickets_last_90d           0.0543
   7. tickets_per_hour_90d               0.0487
   8. usage_declining                    0.0412
   9. genres_last_90d                    0.0389
  10. is_first_90_days                   0.0324
  11. has_recent_billing_failure         0.0298
  12. device_count                       0.0267
  13. recent_downgrade                   0.0234
  14. declining_with_tickets             0.0198
  15. active_last_7d                     0.0176

The top features directly correspond to the domain experts' insights. days_since_last_login (the customer success manager's "silence is deafening") is the most important feature by a wide margin. hours_change_30d (the "usage cliff") is second. support_tickets_last_90d is sixth.

The interaction features --- declining_with_tickets, new_and_inactive, single_device_single_genre --- appear in the top 20. They contribute less individually than the primary features but add information that the model cannot easily infer from the components alone.

Phase 5: Leakage Audit

Before declaring victory, the team ran the leakage checklist:

# Per-feature AUC check
from sklearn.metrics import roc_auc_score

print("Leakage audit --- per-feature AUC:")
for col in numeric_features.columns:
    valid = numeric_features[col].notna()
    if valid.sum() > 100:
        auc = roc_auc_score(target[valid], numeric_features[col][valid])
        if auc > 0.85 or auc < 0.15:
            print(f"  WARNING: {col:35s}  AUC: {auc:.4f} --- INVESTIGATE")
        elif auc > 0.70 or auc < 0.30:
            print(f"  STRONG:  {col:35s}  AUC: {auc:.4f}")

Leakage audit --- per-feature AUC:
  STRONG:  days_since_last_login              AUC: 0.7842
  STRONG:  active_last_7d                     AUC: 0.7318
  STRONG:  hours_last_30d                     AUC: 0.7124
  STRONG:  sessions_last_30d                  AUC: 0.7056

No feature exceeded the 0.85 threshold. The model is clean.

Lessons Learned

Domain experts are your most valuable collaborators. Every top feature in the model was suggested by someone who understood the business, not by automated feature generation. The 60-minute brainstorming session was the highest-ROI activity of the entire project.
Recency dominates. In subscription models, "how long since the user last engaged" is almost always the strongest signal. The customer success manager called it "the silence is deafening." The data confirmed it.
Trends matter more than snapshots. hours_change_30d (a trend) was more important than hours_last_30d (a snapshot). The direction of behavioral change predicts future behavior better than the current level.
Interaction features add value at the margin. They are not individually powerful, but declining_with_tickets captures a compound risk that neither usage_declining nor support_tickets_last_30d fully captures alone. The model needs both the ingredients and the recipe.
The leakage audit is not optional. A 12-point AUC improvement is exciting, but it is only real if the features are legitimate. Run the audit before you tell anyone the results.

Discussion Questions

The billing engineer mentioned "failed payment retries" as a signal. This feature had moderate importance (rank 11). Why might billing failures be a less reliable signal than usage decline? Under what circumstances would billing failure be the primary signal?
The team created promo_expired (subscriber signed up during a promotion and the promotional price has ended). This feature captures a specific causal mechanism: price sensitivity. How would you test whether this feature is capturing real price sensitivity vs. just being correlated with tenure?
The customer success manager's insight about "multiple tickets in a short window" led to recent_unresolved_ticket. This is a binary feature. What are the tradeoffs of encoding it as binary (yes/no) vs. continuous (number of tickets in the window)?
If StreamFlow's retention team can only contact 5,000 subscribers per month, how would you use this model to select the 5,000 with the highest expected return from intervention? Note: the model predicts who will churn, not who will respond to intervention. What additional information would you need?

This case study supports Chapter 6: Feature Engineering. Return to the chapter for full context.