> War Story --- In 2021, a mid-size SaaS company was losing subscribers at an alarming rate. The data science team spent three weeks experimenting with neural network architectures --- LSTMs over usage sequences, attention mechanisms over...
In This Chapter
- The Art and Science of Creating Predictive Variables
- The Five-Minute Feature That Beat Three Weeks of Neural Architecture Search
- Why Features Matter More Than Algorithms
- The Feature Engineering Recipe
- Temporal Features: Recency, Frequency, and Tenure
- RFM Features for Subscription Data
- Trend Features: Capturing Behavioral Change
- Mathematical Transformations
- Date Feature Extraction
- Ratio and Normalized Features
- Text Feature Extraction (Preview)
- Binning: Converting Continuous Features to Categories
- Target Encoding (Advanced)
- Putting It All Together: The StreamFlow Feature Matrix
- Feature Engineering Without Data Leakage
- Feature Validation: Sanity Checks Before Modeling
- The Feature Engineering Mindset
- Progressive Project Milestone M2: Engineer StreamFlow Features
- Summary
Chapter 6: Feature Engineering
The Art and Science of Creating Predictive Variables
Learning Objectives
By the end of this chapter, you will be able to:
- Generate features from raw data using domain knowledge
- Create temporal features (recency, frequency, tenure, trends)
- Apply mathematical transformations (log, polynomial, interaction terms)
- Extract features from dates, text, and structured data
- Use target encoding and other advanced techniques responsibly
The Five-Minute Feature That Beat Three Weeks of Neural Architecture Search
War Story --- In 2021, a mid-size SaaS company was losing subscribers at an alarming rate. The data science team spent three weeks experimenting with neural network architectures --- LSTMs over usage sequences, attention mechanisms over clickstream data, autoencoders for representation learning. Their best model achieved an AUC of 0.78 on the holdout set. Then a new hire, fresh from a customer success role, asked a simple question: "Do we know when each user last logged in?" The team added a single feature ---
days_since_last_login--- to a gradient boosted tree. The AUC jumped to 0.84. One feature, five minutes of work, and it outperformed three weeks of architecture search.The lesson is not that neural networks are bad. The lesson is that feature engineering is where domain knowledge meets data, and domain knowledge almost always wins. The customer success manager knew that login recency was the single strongest signal of disengagement because she had watched hundreds of customers ghost their way to cancellation. She did not need an LSTM to discover what she already knew. She needed someone to ask her.
That war story should recalibrate your intuition about where to invest your time. In a typical data science project, feature engineering accounts for 60-80% of the predictive lift. Not the algorithm. Not the hyperparameters. The features. Andrew Ng has said it plainly: "Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering."
This chapter is about that craft. It is about the systematic process of translating domain knowledge, business logic, and raw intuition into computable variables that give your model the information it needs to make good predictions. We will work primarily with the StreamFlow churn dataset you extracted in Chapter 5, and we will build features that a subscription business expert would recognize as meaningful --- because that is the point.
Feature engineering is not a mechanical process. It is thinking, with code.
Why Features Matter More Than Algorithms
Before we write a single line of feature engineering code, let us establish why this chapter is arguably the most important one in the book.
Consider a simple experiment. We take the raw StreamFlow subscriber data --- subscriber_id, signup_date, plan_type, last_login_date, total_hours_watched, support_ticket_count, device_list --- and we train three models:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Load the raw StreamFlow subscriber data
df = pd.read_csv('streamflow_subscribers.csv', parse_dates=['signup_date', 'last_login_date'])
# Minimal features: just the raw numeric columns
raw_features = ['total_hours_watched', 'support_ticket_count', 'num_devices']
X_raw = df[raw_features].fillna(0)
y = df['churned_within_30_days']
X_train, X_test, y_train, y_test = train_test_split(
X_raw, y, test_size=0.2, random_state=42, stratify=y
)
# Three models, same raw features
models = {
'Logistic Regression': Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(random_state=42))
]),
'Random Forest': RandomForestClassifier(n_estimators=200, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=200, random_state=42)
}
print("=== Raw Features (3 features) ===")
for name, model in models.items():
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"{name:25s} AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")
=== Raw Features (3 features) ===
Logistic Regression AUC: 0.641 (+/- 0.012)
Random Forest AUC: 0.687 (+/- 0.015)
Gradient Boosting AUC: 0.702 (+/- 0.014)
Now watch what happens when we engineer features:
# Engineered features from the same raw data
df['tenure_months'] = (
(pd.Timestamp('2025-01-31') - df['signup_date']).dt.days / 30.44
).round(1)
df['days_since_last_login'] = (
pd.Timestamp('2025-01-31') - df['last_login_date']
).dt.days
df['hours_per_tenure_month'] = (
df['total_hours_watched'] / df['tenure_months'].clip(lower=1)
)
df['tickets_per_tenure_month'] = (
df['support_ticket_count'] / df['tenure_months'].clip(lower=1)
)
df['is_multi_device'] = (df['num_devices'] > 1).astype(int)
engineered_features = [
'tenure_months', 'days_since_last_login', 'total_hours_watched',
'hours_per_tenure_month', 'support_ticket_count',
'tickets_per_tenure_month', 'num_devices', 'is_multi_device'
]
X_eng = df[engineered_features].fillna(0)
X_train_eng, X_test_eng, _, _ = train_test_split(
X_eng, y, test_size=0.2, random_state=42, stratify=y
)
print("\n=== Engineered Features (8 features) ===")
for name, model in models.items():
scores = cross_val_score(model, X_train_eng, y_train, cv=5, scoring='roc_auc')
print(f"{name:25s} AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")
=== Engineered Features (8 features) ===
Logistic Regression AUC: 0.761 (+/- 0.009)
Random Forest AUC: 0.803 (+/- 0.011)
Gradient Boosting AUC: 0.819 (+/- 0.010)
The logistic regression with engineered features (0.761) outperforms the gradient boosting with raw features (0.702). The simple model with good features beats the complex model with bad features. This is the single most important lesson in applied machine learning, and it is the reason this chapter exists.
Production Tip --- When your model is not performing well, your first instinct should be to engineer better features, not to try a fancier algorithm. Algorithm shopping is the procrastination of data science. Feature engineering is the work.
The Feature Engineering Recipe
Feature engineering is not random exploration. It follows a systematic process:
-
Understand the domain. Talk to subject matter experts. Read the business wiki. Shadow a customer success manager. The best features come from people who understand the problem, not from people who understand gradient boosting.
-
Ask: "What would a human expert look at?" If you were a StreamFlow customer success manager trying to predict which subscribers would cancel, what would you check? Their recent usage? How long they have been a customer? Whether they have contacted support? Whether their usage is trending up or down? Every answer to this question is a candidate feature.
-
Translate intuition into computable variables. "Usage is declining" becomes
avg_hours_last_30d - avg_hours_prior_30d. "They seem disengaged" becomesdays_since_last_login. "They have had a bad experience" becomessupport_tickets_last_90d. -
Validate features against the target. Compute correlation, mutual information, or simple AUC of each feature individually. Features that have no univariate relationship with the target are unlikely to help (though interaction effects can be exceptions).
-
Respect the train/test boundary. Every feature computation must be reproducible at prediction time using only information available at that moment. We will return to this critical constraint at the end of the chapter.
Let us work through each category of feature engineering technique.
Temporal Features: Recency, Frequency, and Tenure
Temporal features are the workhorses of subscription and behavioral models. They capture when things happened, how often, and for how long. If you remember nothing else from this chapter, remember RFT: Recency, Frequency, Tenure.
Recency Features
Recency measures how recently a user took an action. In subscription businesses, recency is often the single strongest predictor of churn. A user who logged in yesterday is dramatically less likely to cancel than one who has not logged in for three weeks.
import pandas as pd
import numpy as np
# Assume prediction_date is the date we make predictions
prediction_date = pd.Timestamp('2025-01-31')
def create_recency_features(df, prediction_date):
"""Create recency features from event timestamps."""
features = pd.DataFrame(index=df.index)
# Days since last login
features['days_since_last_login'] = (
prediction_date - df['last_login_date']
).dt.days
# Days since last support ticket
features['days_since_last_ticket'] = (
prediction_date - df['last_ticket_date']
).dt.days
# Days since last plan change (upgrade or downgrade)
features['days_since_plan_change'] = (
prediction_date - df['last_plan_change_date']
).dt.days
# Days since account creation (this is tenure, but it is also a recency metric)
features['account_age_days'] = (
prediction_date - df['signup_date']
).dt.days
# Binary: logged in within last 7 days?
features['active_last_7d'] = (
features['days_since_last_login'] <= 7
).astype(int)
# Binary: logged in within last 30 days?
features['active_last_30d'] = (
features['days_since_last_login'] <= 30
).astype(int)
return features
recency = create_recency_features(df, prediction_date)
print(recency.describe().round(1))
days_since_last_login days_since_last_ticket days_since_plan_change account_age_days active_last_7d active_last_30d
count 2400000.0 2400000.0 2400000.0 2400000.0 2400000.0 2400000.0
mean 12.4 87.3 241.6 438.2 0.7 0.9
std 18.7 112.4 198.3 312.5 0.5 0.3
min 0.0 0.0 0.0 1.0 0.0 0.0
25% 2.0 14.0 89.0 187.0 0.0 1.0
50% 5.0 48.0 184.0 364.0 1.0 1.0
75% 14.0 118.0 342.0 621.0 1.0 1.0
max 365.0 1095.0 1825.0 2190.0 1.0 1.0
Notice the distribution of days_since_last_login. The median is 5 days, but the 75th percentile is 14 days and the max is 365. That long tail contains your high-risk churners. The binary threshold features (active_last_7d, active_last_30d) give your model clean cutpoints to work with.
Frequency Features
Frequency measures how often a user takes an action within a defined window. The window matters. A user who watched 100 hours in their first month but zero in the last month is very different from a user who watches 10 hours every month.
def create_frequency_features(df, events_df, prediction_date):
"""
Create frequency features from an events table.
events_df has columns: subscriber_id, event_date, event_type,
hours_watched, genre
"""
features = pd.DataFrame(index=df.index)
# Filter events before prediction date (no future leakage)
events = events_df[events_df['event_date'] < prediction_date].copy()
# --- Usage frequency by window ---
for window_days, label in [(7, '7d'), (30, '30d'), (90, '90d')]:
cutoff = prediction_date - pd.Timedelta(days=window_days)
window_events = events[events['event_date'] >= cutoff]
usage = window_events.groupby('subscriber_id').agg(
sessions=('event_date', 'nunique'),
total_hours=('hours_watched', 'sum'),
unique_genres=('genre', 'nunique')
).rename(columns={
'sessions': f'sessions_last_{label}',
'total_hours': f'hours_last_{label}',
'unique_genres': f'genres_last_{label}'
})
features = features.join(usage, how='left')
# --- Support ticket frequency ---
for window_days, label in [(30, '30d'), (90, '90d')]:
cutoff = prediction_date - pd.Timedelta(days=window_days)
tickets = events[
(events['event_type'] == 'support_ticket') &
(events['event_date'] >= cutoff)
]
ticket_counts = tickets.groupby('subscriber_id').size()
features[f'support_tickets_last_{label}'] = ticket_counts
# Fill NaN with 0 (no events = zero frequency)
features = features.fillna(0)
return features
freq = create_frequency_features(df, events_df, prediction_date)
print(freq[['sessions_last_7d', 'sessions_last_30d', 'hours_last_30d',
'genres_last_30d', 'support_tickets_last_90d']].describe().round(1))
sessions_last_7d sessions_last_30d hours_last_30d genres_last_30d support_tickets_last_90d
count 2400000.0 2400000.0 2400000.0 2400000.0 2400000.0
mean 3.2 11.4 18.7 3.1 0.8
std 2.4 7.8 14.2 2.0 1.4
min 0.0 0.0 0.0 0.0 0.0
25% 1.0 5.0 8.2 2.0 0.0
50% 3.0 10.0 15.6 3.0 0.0
75% 5.0 16.0 26.3 4.0 1.0
max 7.0 30.0 120.4 8.0 23.0
Common Mistake --- Computing frequency features over the entire history of a user without windowing. A user who watched 500 hours over 3 years but zero in the last month is not an active user. Always window your frequency features, and always create multiple windows (7-day, 30-day, 90-day) so the model can learn which time horizon matters most.
Tenure Features
Tenure is how long a customer has been a customer. In subscription businesses, churn risk is highest in the first 90 days (the "danger zone") and lowest for subscribers who have survived their first year.
def create_tenure_features(df, prediction_date):
"""Create tenure features from signup date."""
features = pd.DataFrame(index=df.index)
# Raw tenure
features['tenure_days'] = (prediction_date - df['signup_date']).dt.days
features['tenure_months'] = (features['tenure_days'] / 30.44).round(1)
features['tenure_years'] = (features['tenure_days'] / 365.25).round(2)
# Tenure buckets (the "danger zone" pattern)
features['tenure_bucket'] = pd.cut(
features['tenure_months'],
bins=[0, 1, 3, 6, 12, 24, float('inf')],
labels=['0-1mo', '1-3mo', '3-6mo', '6-12mo', '12-24mo', '24mo+'],
right=True
)
# Binary: is the subscriber in the high-risk first 90 days?
features['is_first_90_days'] = (features['tenure_days'] <= 90).astype(int)
return features
tenure = create_tenure_features(df, prediction_date)
print(tenure['tenure_bucket'].value_counts().sort_index())
0-1mo 98400
1-3mo 192000
3-6mo 312000
6-12mo 504000
12-24mo 648000
24mo+ 645600
Name: tenure_bucket, dtype: int64
RFM Features for Subscription Data
RFM --- Recency, Frequency, Monetary --- is a framework from direct marketing that translates perfectly to subscription analytics. For StreamFlow, we adapt "Monetary" to "engagement value" since all subscribers in a plan tier pay the same price:
def create_rfm_features(df, events_df, prediction_date):
"""
Create RFM (Recency, Frequency, Monetary/Engagement) features.
For subscription businesses, 'Monetary' is adapted to engagement
metrics since revenue per user is tied to plan tier.
"""
features = pd.DataFrame(index=df.index)
# --- Recency (lower = more recent = better) ---
features['rfm_recency'] = (
prediction_date - df['last_login_date']
).dt.days
# --- Frequency (last 90 days) ---
cutoff_90d = prediction_date - pd.Timedelta(days=90)
recent_events = events_df[events_df['event_date'] >= cutoff_90d]
session_counts = recent_events.groupby('subscriber_id')['event_date'].nunique()
features['rfm_frequency'] = session_counts.reindex(df.index, fill_value=0)
# --- Monetary (avg hours per session, last 90 days) ---
engagement = recent_events.groupby('subscriber_id').agg(
total_hours=('hours_watched', 'sum'),
total_sessions=('event_date', 'nunique')
)
engagement['avg_hours_per_session'] = (
engagement['total_hours'] / engagement['total_sessions'].clip(lower=1)
)
features['rfm_monetary'] = engagement['avg_hours_per_session'].reindex(
df.index, fill_value=0
)
# --- RFM Score (quintile-based) ---
# Recency: lower is better, so reverse the scoring
features['rfm_recency_score'] = pd.qcut(
features['rfm_recency'], q=5, labels=[5, 4, 3, 2, 1]
).astype(int)
features['rfm_frequency_score'] = pd.qcut(
features['rfm_frequency'].rank(method='first'), q=5, labels=[1, 2, 3, 4, 5]
).astype(int)
features['rfm_monetary_score'] = pd.qcut(
features['rfm_monetary'].rank(method='first'), q=5, labels=[1, 2, 3, 4, 5]
).astype(int)
# Combined RFM score (simple sum)
features['rfm_total_score'] = (
features['rfm_recency_score'] +
features['rfm_frequency_score'] +
features['rfm_monetary_score']
)
return features
rfm = create_rfm_features(df, events_df, prediction_date)
print(rfm[['rfm_recency', 'rfm_frequency', 'rfm_monetary', 'rfm_total_score']].describe().round(2))
rfm_recency rfm_frequency rfm_monetary rfm_total_score
count 2400000.00 2400000.00 2400000.00 2400000.00
mean 12.40 24.80 1.64 9.00
std 18.70 18.20 0.89 3.46
min 0.00 0.00 0.00 3.00
25% 2.00 10.00 1.02 7.00
50% 5.00 22.00 1.58 9.00
75% 14.00 36.00 2.18 11.00
max 365.00 90.00 8.42 15.00
Production Tip --- In production systems, pre-compute RFM features nightly in SQL (Chapter 5) and store them in a feature store. Your model training pipeline reads from the store; your real-time prediction service reads from the same store. This guarantees that the features used in training are identical to the features used in serving.
Trend Features: Capturing Behavioral Change
Static features tell you what a user looks like right now. Trend features tell you whether they are getting better or worse. A subscriber who watched 20 hours last month is very different depending on whether they watched 25 hours the month before (declining) or 15 hours (increasing).
def create_trend_features(df, events_df, prediction_date):
"""
Create trend features by comparing recent behavior to prior behavior.
The key insight: the direction of change is often more predictive
than the absolute level.
"""
features = pd.DataFrame(index=df.index)
# Define two periods
recent_start = prediction_date - pd.Timedelta(days=30)
prior_start = prediction_date - pd.Timedelta(days=60)
recent_events = events_df[
(events_df['event_date'] >= recent_start) &
(events_df['event_date'] < prediction_date)
]
prior_events = events_df[
(events_df['event_date'] >= prior_start) &
(events_df['event_date'] < recent_start)
]
# Hours watched in each period
recent_hours = recent_events.groupby('subscriber_id')['hours_watched'].sum()
prior_hours = prior_events.groupby('subscriber_id')['hours_watched'].sum()
features['hours_last_30d'] = recent_hours.reindex(df.index, fill_value=0)
features['hours_prior_30d'] = prior_hours.reindex(df.index, fill_value=0)
# Absolute change
features['hours_change_30d'] = (
features['hours_last_30d'] - features['hours_prior_30d']
)
# Percentage change (with safe division)
features['hours_pct_change_30d'] = np.where(
features['hours_prior_30d'] > 0,
(features['hours_change_30d'] / features['hours_prior_30d']) * 100,
np.where(features['hours_last_30d'] > 0, 100.0, 0.0)
)
# Clip extreme percentage changes
features['hours_pct_change_30d'] = features['hours_pct_change_30d'].clip(-100, 500)
# Session frequency trend
recent_sessions = recent_events.groupby('subscriber_id')['event_date'].nunique()
prior_sessions = prior_events.groupby('subscriber_id')['event_date'].nunique()
features['sessions_last_30d'] = recent_sessions.reindex(df.index, fill_value=0)
features['sessions_prior_30d'] = prior_sessions.reindex(df.index, fill_value=0)
features['sessions_change_30d'] = (
features['sessions_last_30d'] - features['sessions_prior_30d']
)
# Binary trend indicator
features['usage_declining'] = (features['hours_change_30d'] < -2).astype(int)
features['usage_increasing'] = (features['hours_change_30d'] > 2).astype(int)
return features
trends = create_trend_features(df, events_df, prediction_date)
print(trends[['hours_change_30d', 'hours_pct_change_30d',
'sessions_change_30d', 'usage_declining']].describe().round(2))
hours_change_30d hours_pct_change_30d sessions_change_30d usage_declining
count 2400000.00 2400000.00 2400000.00 2400000.00
mean -1.24 -4.68 -0.42 0.31
std 9.87 42.34 4.18 0.46
min -78.40 -100.00 -22.00 0.00
25% -5.20 -28.57 -2.00 0.00
50% -0.40 -2.14 0.00 0.00
75% 3.10 18.75 1.00 1.00
max 92.60 500.00 24.00 1.00
Notice that the mean hours_change_30d is -1.24 --- on average, usage is slightly declining across the subscriber base. This makes sense for a service with 8.2% monthly churn. The usage_declining binary flag captures the 31% of subscribers whose usage dropped by more than 2 hours. This is a clean, interpretable signal.
Try It --- Create a three-period trend feature: compare the last 30 days to the prior 30 days and to the 30 days before that. A subscriber declining for two consecutive periods is a much stronger churn signal than one declining for a single period. You can encode this as a categorical: "accelerating decline", "recent decline", "stable", "recent growth", "accelerating growth".
Mathematical Transformations
Raw features often have distributions that make it hard for models to learn effectively. Mathematical transformations reshape these distributions to improve model performance.
Log Transformation
The log transformation compresses the right tail of skewed distributions. It is the most common transformation in data science, and for good reason: many real-world variables (revenue, page views, session duration) follow approximately log-normal distributions.
import matplotlib.pyplot as plt
# Total hours watched is right-skewed
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(df['total_hours_watched'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_title('Total Hours Watched (Raw)')
axes[0].set_xlabel('Hours')
# Log transform (add 1 to handle zeros)
df['log_total_hours'] = np.log1p(df['total_hours_watched'])
axes[1].hist(df['log_total_hours'], bins=50, edgecolor='black', alpha=0.7)
axes[1].set_title('Total Hours Watched (Log-Transformed)')
axes[1].set_xlabel('log(1 + Hours)')
plt.tight_layout()
plt.savefig('log_transform_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
When should you use a log transformation?
- Right-skewed distributions with a long tail (revenue, counts, durations)
- Multiplicative relationships that you want to make additive
- Features with orders-of-magnitude variation (a user with 10 hours and a user with 10,000 hours)
When should you not use it?
- Features with zeros or negative values (use
np.log1p()for zeros; avoid log for negative values) - Already-normal distributions (the transformation will skew them)
- Tree-based models (trees split on rank order, so monotonic transformations do not change the splits)
# Log transformation is most helpful for linear models
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
# Compare raw vs. log-transformed for logistic regression
features_raw = df[['total_hours_watched', 'support_ticket_count', 'tenure_days']].fillna(0)
features_log = features_raw.copy()
features_log['total_hours_watched'] = np.log1p(features_log['total_hours_watched'])
features_log['support_ticket_count'] = np.log1p(features_log['support_ticket_count'])
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
features_raw, y, test_size=0.2, random_state=42, stratify=y
)
X_train_log, X_test_log, _, _ = train_test_split(
features_log, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
lr = LogisticRegression(random_state=42)
# Raw
lr.fit(scaler.fit_transform(X_train_raw), y_train)
auc_raw = roc_auc_score(y_test, lr.predict_proba(scaler.transform(X_test_raw))[:, 1])
# Log-transformed
lr.fit(scaler.fit_transform(X_train_log), y_train)
auc_log = roc_auc_score(y_test, lr.predict_proba(scaler.transform(X_test_log))[:, 1])
print(f"Logistic Regression AUC (raw): {auc_raw:.4f}")
print(f"Logistic Regression AUC (log-transformed): {auc_log:.4f}")
Logistic Regression AUC (raw): 0.6812
Logistic Regression AUC (log-transformed): 0.7134
Box-Cox Transformation
The Box-Cox transformation is a generalization of the log transform. It finds the optimal power transformation to make a distribution as close to normal as possible:
from scipy import stats
# Box-Cox requires strictly positive values
hours_positive = df['total_hours_watched'][df['total_hours_watched'] > 0]
# Find optimal lambda
transformed, fitted_lambda = stats.boxcox(hours_positive)
print(f"Optimal lambda: {fitted_lambda:.4f}")
# lambda = 0 corresponds to log; lambda = 1 means no transformation
Optimal lambda: 0.1247
A lambda near 0 confirms that the log transform is close to optimal for this feature. In practice, np.log1p() is usually sufficient, and Box-Cox is reserved for cases where you need the extra precision.
from sklearn.preprocessing import PowerTransformer
# PowerTransformer handles the Box-Cox transformation in a sklearn pipeline
# Yeo-Johnson variant handles zeros and negative values
pt = PowerTransformer(method='yeo-johnson', standardize=True)
features_to_transform = ['total_hours_watched', 'support_ticket_count', 'tenure_days']
X_transformed = pt.fit_transform(df[features_to_transform].fillna(0))
print(f"Lambdas: {dict(zip(features_to_transform, pt.lambdas_.round(4)))}")
Lambdas: {'total_hours_watched': 0.1384, 'support_ticket_count': 0.3012, 'tenure_days': 0.7821}
Common Mistake --- Fitting the
PowerTransformeron the entire dataset including the test set. The optimal lambda must be computed only on the training data, then applied to the test data usingtransform(). Fitting on the full dataset leaks information from the test set into your feature engineering.
Polynomial Features
Polynomial features create non-linear terms from your existing features. They are most useful for linear models that cannot learn non-linear relationships on their own.
from sklearn.preprocessing import PolynomialFeatures
# Create quadratic features (degree=2)
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
base_features = df[['tenure_months', 'hours_last_30d', 'support_tickets_last_90d']].fillna(0)
poly_features = poly.fit_transform(base_features)
print(f"Input features: {base_features.shape[1]}")
print(f"Output features: {poly_features.shape[1]}")
print(f"Feature names: {poly.get_feature_names_out()}")
Input features: 3
Output features: 9
Feature names: ['tenure_months' 'hours_last_30d' 'support_tickets_last_90d'
'tenure_months^2' 'tenure_months hours_last_30d'
'tenure_months support_tickets_last_90d' 'hours_last_30d^2'
'hours_last_30d support_tickets_last_90d' 'support_tickets_last_90d^2']
Three features become nine. With degree=3, they would become 19. With 10 input features and degree=2, you get 65 outputs. Polynomial features scale combinatorially, which means you need to be selective.
Production Tip --- Use
interaction_only=Trueif you want interaction terms (tenure * hours) without squared terms (tenure^2). This is often the better choice because interaction terms capture "these two things together" while squared terms capture "more of this thing has a non-linear effect."
Interaction Features
Interaction features capture the combined effect of two variables. They encode the insight that the effect of one variable depends on the value of another.
def create_interaction_features(df):
"""
Create interaction features that encode domain knowledge.
Each interaction should have a business interpretation.
Do not blindly generate all pairwise interactions.
"""
features = pd.DataFrame(index=df.index)
# Tenure * usage trend: new users with declining usage are highest risk
features['tenure_x_usage_decline'] = (
df['tenure_months'] * df['hours_change_30d']
)
# Support tickets per hour watched: high ratio = frustrated user
features['tickets_per_hour'] = np.where(
df['hours_last_30d'] > 0,
df['support_tickets_last_90d'] / df['hours_last_30d'],
df['support_tickets_last_90d'] # if no hours, just use ticket count
)
# Device count * genre diversity: multi-device, multi-genre = engaged household
features['device_x_genre_diversity'] = (
df['num_devices'] * df['genres_last_30d']
)
# Is the user a new subscriber (< 3 months) with zero recent usage?
features['new_and_inactive'] = (
(df['tenure_months'] < 3) & (df['hours_last_30d'] == 0)
).astype(int)
return features
interactions = create_interaction_features(df)
print(interactions.describe().round(3))
tenure_x_usage_decline tickets_per_hour device_x_genre_diversity new_and_inactive
count 2400000.000 2400000.000 2400000.000 2400000.000
mean -18.446 0.062 5.834 0.024
std 189.723 0.187 5.192 0.153
min -4896.000 0.000 0.000 0.000
25% -48.360 0.000 2.000 0.000
50% -3.640 0.000 4.000 0.000
75% 26.460 0.047 8.000 0.000
max 7200.000 12.000 48.000 1.000
The key rule for interaction features: every interaction should have a business interpretation. If you cannot explain why two features interact to someone who understands the business, the interaction is probably noise. "Support tickets per hour watched" makes sense --- it captures frustration intensity relative to engagement. "Tenure times device count" is less clear --- what does it mean for a 12-month subscriber to have 3 devices? That interaction should earn its place through empirical validation, not just combinatorial generation.
Date Feature Extraction
Dates contain far more information than a single timestamp. A subscription that started on January 2nd behaves differently from one that started on December 26th, and a user who signed up on a Monday is statistically different from one who signed up on a Saturday.
def create_date_features(df, date_col='signup_date'):
"""Extract features from a datetime column."""
features = pd.DataFrame(index=df.index)
dt = df[date_col]
# Calendar features
features[f'{date_col}_month'] = dt.dt.month
features[f'{date_col}_day_of_week'] = dt.dt.dayofweek # 0=Mon, 6=Sun
features[f'{date_col}_day_of_month'] = dt.dt.day
features[f'{date_col}_quarter'] = dt.dt.quarter
features[f'{date_col}_is_weekend'] = (dt.dt.dayofweek >= 5).astype(int)
# Cyclical encoding for month (so December and January are close)
features[f'{date_col}_month_sin'] = np.sin(2 * np.pi * dt.dt.month / 12)
features[f'{date_col}_month_cos'] = np.cos(2 * np.pi * dt.dt.month / 12)
# Cyclical encoding for day of week
features[f'{date_col}_dow_sin'] = np.sin(2 * np.pi * dt.dt.dayofweek / 7)
features[f'{date_col}_dow_cos'] = np.cos(2 * np.pi * dt.dt.dayofweek / 7)
# Business-specific: did they sign up during a promotion period?
promo_periods = [
(pd.Timestamp('2024-11-25'), pd.Timestamp('2024-12-02')), # Black Friday
(pd.Timestamp('2024-01-01'), pd.Timestamp('2024-01-07')), # New Year
]
features[f'{date_col}_during_promo'] = 0
for start, end in promo_periods:
features.loc[
(dt >= start) & (dt <= end), f'{date_col}_during_promo'
] = 1
return features
date_features = create_date_features(df, 'signup_date')
print(date_features.head(3).T)
0 1 2
signup_date_month 3.0 7.0 11.0
signup_date_day_of_week 2.0 4.0 1.0
signup_date_day_of_month 15.0 22.0 8.0
signup_date_quarter 1.0 3.0 4.0
signup_date_is_weekend 0.0 0.0 0.0
signup_date_month_sin 1.0 0.5 -0.9
signup_date_month_cos 0.0 -0.9 -0.5
signup_date_dow_sin 0.8 -0.4 0.8
signup_date_dow_cos -0.2 -0.2 -0.2
signup_date_during_promo 0.0 0.0 0.0
Common Mistake --- Using month as a raw integer (1-12). This tells the model that December (12) is 12 times January (1) and that December is far from January, when cyclically they are adjacent. Cyclical encoding (sine/cosine) preserves the circular relationship. This matters most for linear models. Tree-based models can handle raw integers because they split on thresholds, not on magnitude.
Ratio and Normalized Features
Raw counts are often less informative than ratios. "This user filed 5 support tickets" is ambiguous. Is that a lot? It depends on how long they have been a customer and how much they use the product.
def create_ratio_features(df):
"""
Create ratio features that normalize raw counts by exposure.
Ratios control for the 'opportunity to have the event' and
make comparisons fair across users with different tenures
and usage levels.
"""
features = pd.DataFrame(index=df.index)
# Usage intensity: hours per month of tenure
features['hours_per_tenure_month'] = np.where(
df['tenure_months'] > 0,
df['total_hours_watched'] / df['tenure_months'],
0
)
# Support burden: tickets per month of tenure
features['tickets_per_tenure_month'] = np.where(
df['tenure_months'] > 0,
df['support_ticket_count'] / df['tenure_months'],
0
)
# Engagement depth: hours per session (recent)
features['hours_per_session_30d'] = np.where(
df['sessions_last_30d'] > 0,
df['hours_last_30d'] / df['sessions_last_30d'],
0
)
# Genre exploration rate: unique genres / total sessions
features['genre_diversity_score'] = np.where(
df['sessions_last_90d'] > 0,
df['genres_last_90d'] / df['sessions_last_90d'],
0
)
# Device utilization: devices used last 30 days / total registered devices
features['device_utilization'] = np.where(
df['num_devices'] > 0,
df['active_devices_last_30d'] / df['num_devices'],
0
)
# Weekend usage ratio: what fraction of usage happens on weekends?
features['weekend_usage_ratio'] = np.where(
df['hours_last_30d'] > 0,
df['weekend_hours_last_30d'] / df['hours_last_30d'],
0
)
return features
ratios = create_ratio_features(df)
print(ratios.describe().round(3))
hours_per_tenure_month tickets_per_tenure_month hours_per_session_30d genre_diversity_score device_utilization weekend_usage_ratio
count 2400000.000 2400000.000 2400000.000 2400000.000 2400000.000 2400000.000
mean 1.842 0.067 1.621 0.134 0.724 0.312
std 1.406 0.098 0.894 0.087 0.289 0.148
min 0.000 0.000 0.000 0.000 0.000 0.000
25% 0.912 0.000 1.024 0.071 0.500 0.214
50% 1.583 0.034 1.487 0.125 0.750 0.298
75% 2.404 0.091 2.048 0.182 1.000 0.394
max 24.680 2.333 12.040 1.000 1.000 1.000
A note on genre_diversity_score: this feature captures how broadly a subscriber explores the content library. A subscriber who watches only one genre is more vulnerable to a single bad content decision. A subscriber who explores widely has more reasons to stay. This kind of feature --- which requires domain intuition to create --- is exactly what separates data science from data engineering.
Text Feature Extraction (Preview)
StreamFlow subscribers write support tickets and leave plan-cancellation reasons in free text. Even simple text features can be powerful:
def create_text_features(df, text_col='cancellation_reason'):
"""
Extract simple features from text fields.
Full NLP feature engineering is covered in Chapter 26.
Here we extract basic signals.
"""
features = pd.DataFrame(index=df.index)
text = df[text_col].fillna('')
# Basic: does the field have any text?
features[f'{text_col}_has_text'] = (text.str.len() > 0).astype(int)
# Length features
features[f'{text_col}_char_count'] = text.str.len()
features[f'{text_col}_word_count'] = text.str.split().str.len().fillna(0).astype(int)
# Keyword flags
negative_keywords = ['expensive', 'cost', 'price', 'waste', 'terrible', 'awful', 'hate']
competitor_keywords = ['netflix', 'spotify', 'youtube', 'switched', 'alternative']
features[f'{text_col}_mentions_price'] = (
text.str.lower().str.contains('|'.join(negative_keywords), na=False)
).astype(int)
features[f'{text_col}_mentions_competitor'] = (
text.str.lower().str.contains('|'.join(competitor_keywords), na=False)
).astype(int)
return features
# Only applicable to subscribers who have initiated cancellation
# or left feedback --- not available for current active subscribers
# Use with caution: this feature may only be available AFTER churn intent is visible
Common Mistake --- Using cancellation reason text as a feature to predict churn. The cancellation reason is only written when the user is already canceling --- it is target leakage by definition. The text features above would only be useful for understanding why users churn (descriptive), not for predicting who will churn (predictive). Be ruthless about the timeline: would this data be available at the moment you need to make the prediction?
Binning: Converting Continuous Features to Categories
Binning (also called discretization) converts continuous features into categorical buckets. It can help when the relationship between a feature and the target is non-linear and step-like rather than smooth.
def create_binned_features(df):
"""
Bin continuous features into categories.
Use equal-width bins when the distribution matters.
Use quantile bins when you want equal-sized groups.
Use domain-informed bins when you know the breakpoints.
"""
features = pd.DataFrame(index=df.index)
# Domain-informed bins: tenure buckets based on known churn patterns
features['tenure_risk_group'] = pd.cut(
df['tenure_months'],
bins=[0, 1, 3, 6, 12, 24, float('inf')],
labels=['very_high', 'high', 'elevated', 'moderate', 'low', 'very_low']
)
# Quantile bins: usage quartiles (equal-frequency)
features['usage_quartile'] = pd.qcut(
df['hours_last_30d'], q=4, labels=['Q1_low', 'Q2', 'Q3', 'Q4_high']
)
# Equal-width bins: support ticket count
features['ticket_level'] = pd.cut(
df['support_tickets_last_90d'],
bins=[-1, 0, 1, 3, float('inf')],
labels=['none', 'low', 'moderate', 'high']
)
return features
When to bin:
- When the relationship is clearly non-linear with plateau effects (churn risk drops sharply after 3 months of tenure, then levels off)
- When you need interpretable segments for business stakeholders ("high-risk tenure group" is more actionable than "tenure coefficient = -0.028")
- When working with linear models that cannot capture non-linearity otherwise
When not to bin:
- With tree-based models (they find optimal split points automatically)
- When the relationship is smoothly linear or monotonic
- When you are discarding useful variation within each bin
Production Tip --- If you bin features, store the bin edges as configuration, not as code. When your data distribution shifts (and it will), you need to decide whether to keep the original bins or re-compute them. In most production systems, you keep the original bins and monitor for distributional drift.
Target Encoding (Advanced)
Target encoding replaces a categorical value with the mean of the target variable for that category. It is powerful for high-cardinality categoricals but dangerous if done incorrectly.
# Naive target encoding (DO NOT USE IN PRODUCTION)
plan_means_WRONG = df.groupby('plan_type')['churned_within_30_days'].mean()
df['plan_target_enc_WRONG'] = df['plan_type'].map(plan_means_WRONG)
# This leaks the target into the features!
The problem: if you compute the target mean on the entire dataset, the encoded feature contains information from the test set. Worse, for each individual row, the target mean includes that row's own target value. This is leakage.
The correct approach uses cross-validation within the training set:
from sklearn.model_selection import KFold
def target_encode_cv(train_df, test_df, cat_col, target_col, n_splits=5,
smoothing=10, random_state=42):
"""
Target encoding with cross-validation to prevent leakage.
For each fold of the training data:
- Compute target mean on the out-of-fold data
- Apply to the in-fold data
For test data:
- Use the global target mean from the full training set
Smoothing blends the category mean with the global mean,
regularizing categories with few observations.
"""
global_mean = train_df[target_col].mean()
# Initialize encoded columns
train_encoded = pd.Series(index=train_df.index, dtype=float)
kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
for train_idx, val_idx in kf.split(train_df):
# Compute target stats on training fold
fold_train = train_df.iloc[train_idx]
stats = fold_train.groupby(cat_col)[target_col].agg(['mean', 'count'])
# Apply smoothing: blend category mean with global mean
# More observations -> trust the category mean more
smoothed_mean = (
(stats['count'] * stats['mean'] + smoothing * global_mean) /
(stats['count'] + smoothing)
)
# Apply to validation fold
val_fold = train_df.iloc[val_idx]
train_encoded.iloc[val_idx] = val_fold[cat_col].map(smoothed_mean)
# Fill any unmapped categories with global mean
train_encoded = train_encoded.fillna(global_mean)
# For test data: use full training set statistics
stats = train_df.groupby(cat_col)[target_col].agg(['mean', 'count'])
smoothed_mean = (
(stats['count'] * stats['mean'] + smoothing * global_mean) /
(stats['count'] + smoothing)
)
test_encoded = test_df[cat_col].map(smoothed_mean).fillna(global_mean)
return train_encoded, test_encoded
# Usage
X_train_split, X_test_split, y_train_split, y_test_split = train_test_split(
df, y, test_size=0.2, random_state=42, stratify=y
)
train_enc, test_enc = target_encode_cv(
X_train_split, X_test_split,
cat_col='primary_genre',
target_col='churned_within_30_days',
n_splits=5,
smoothing=10,
random_state=42
)
print("Target-encoded genre (training set sample):")
print(train_enc.head(10).round(4))
Target-encoded genre (training set sample):
0 0.0724
1 0.0891
2 0.0634
3 0.1042
4 0.0724
5 0.0634
6 0.0891
7 0.0724
8 0.1042
9 0.0788
dtype: float64
The smoothing parameter is crucial. Without smoothing, a genre with only 3 subscribers and a 33% churn rate gets an encoded value of 0.33 --- which is noise, not signal. Smoothing pulls this toward the global mean (0.082), giving (3 * 0.33 + 10 * 0.082) / (3 + 10) = 0.139. With more data, the category-specific mean dominates. With little data, the global mean dominates. This is Bayesian shrinkage in disguise.
Common Mistake --- Using target encoding without smoothing on categories with few observations. A category with 2 observations and 100% churn rate is not truly 100% churn --- it is 2 random observations. Smoothing prevents your model from overfitting to small-sample noise.
Putting It All Together: The StreamFlow Feature Matrix
Let us assemble all of our engineered features into a single feature matrix for the progressive project:
def engineer_streamflow_features(df, events_df, prediction_date):
"""
Master feature engineering function for the StreamFlow churn model.
Takes raw subscriber data and events, returns a feature matrix.
This is Milestone M2 of the progressive project.
"""
features = pd.DataFrame(index=df.index)
# === Tenure features ===
features['tenure_months'] = (
(prediction_date - df['signup_date']).dt.days / 30.44
).round(1)
features['is_first_90_days'] = (features['tenure_months'] < 3).astype(int)
# === Recency features ===
features['days_since_last_login'] = (
prediction_date - df['last_login_date']
).dt.days
features['active_last_7d'] = (features['days_since_last_login'] <= 7).astype(int)
# === Frequency features (multiple windows) ===
for window_days, label in [(7, '7d'), (30, '30d'), (90, '90d')]:
cutoff = prediction_date - pd.Timedelta(days=window_days)
window_events = events_df[
(events_df['event_date'] >= cutoff) &
(events_df['event_date'] < prediction_date)
]
usage = window_events.groupby('subscriber_id').agg(
sessions=('event_date', 'nunique'),
total_hours=('hours_watched', 'sum'),
unique_genres=('genre', 'nunique')
)
features[f'sessions_last_{label}'] = usage['sessions'].reindex(
df.index, fill_value=0
)
features[f'hours_last_{label}'] = usage['total_hours'].reindex(
df.index, fill_value=0
)
features[f'genres_last_{label}'] = usage['unique_genres'].reindex(
df.index, fill_value=0
)
# === Support tickets ===
cutoff_90d = prediction_date - pd.Timedelta(days=90)
ticket_events = events_df[
(events_df['event_type'] == 'support_ticket') &
(events_df['event_date'] >= cutoff_90d) &
(events_df['event_date'] < prediction_date)
]
features['support_tickets_last_90d'] = (
ticket_events.groupby('subscriber_id').size()
.reindex(df.index, fill_value=0)
)
# === Trend features ===
recent_start = prediction_date - pd.Timedelta(days=30)
prior_start = prediction_date - pd.Timedelta(days=60)
recent = events_df[
(events_df['event_date'] >= recent_start) &
(events_df['event_date'] < prediction_date)
].groupby('subscriber_id')['hours_watched'].sum()
prior = events_df[
(events_df['event_date'] >= prior_start) &
(events_df['event_date'] < recent_start)
].groupby('subscriber_id')['hours_watched'].sum()
features['hours_last_30d_trend'] = (
recent.reindex(df.index, fill_value=0) -
prior.reindex(df.index, fill_value=0)
)
features['usage_declining'] = (features['hours_last_30d_trend'] < -2).astype(int)
# === Ratio features ===
features['hours_per_tenure_month'] = np.where(
features['tenure_months'] > 0,
features['hours_last_90d'] / (features['tenure_months'].clip(upper=3) * 1),
0
)
features['tickets_per_hour'] = np.where(
features['hours_last_90d'] > 0,
features['support_tickets_last_90d'] / features['hours_last_90d'],
features['support_tickets_last_90d']
)
# === Interaction features ===
features['device_x_genre_diversity'] = (
df['num_devices'] * features['genres_last_30d']
)
features['new_and_inactive'] = (
(features['tenure_months'] < 3) & (features['hours_last_30d'] == 0)
).astype(int)
# === Plan change features ===
features['plan_changes'] = df['plan_change_count']
features['has_downgraded'] = df['has_downgraded'].astype(int)
# === Device features ===
features['device_count'] = df['num_devices']
features['is_multi_device'] = (df['num_devices'] > 1).astype(int)
# === Genre diversity ===
features['genre_diversity_score'] = np.where(
features['sessions_last_90d'] > 0,
features['genres_last_90d'] / features['sessions_last_90d'],
0
)
# === Log transforms (for linear models) ===
features['log_hours_last_30d'] = np.log1p(features['hours_last_30d'])
features['log_tenure_months'] = np.log1p(features['tenure_months'])
print(f"Engineered {features.shape[1]} features for {features.shape[0]} subscribers")
return features
# Build the feature matrix
feature_matrix = engineer_streamflow_features(df, events_df, prediction_date)
Engineered 29 features for 2400000 subscribers
# Quick validation: check feature distributions
print(feature_matrix.describe().round(3).T[['mean', 'std', 'min', 'max']])
mean std min max
tenure_months 14.387 10.260 0.000 71.900
is_first_90_days 0.121 0.326 0.000 1.000
days_since_last_login 12.400 18.700 0.000 365.000
active_last_7d 0.687 0.464 0.000 1.000
sessions_last_7d 3.200 2.400 0.000 7.000
hours_last_7d 4.340 3.820 0.000 28.600
genres_last_7d 1.800 1.300 0.000 6.000
sessions_last_30d 11.400 7.800 0.000 30.000
hours_last_30d 18.700 14.200 0.000 120.400
genres_last_30d 3.100 2.000 0.000 8.000
sessions_last_90d 28.600 19.400 0.000 90.000
hours_last_90d 52.800 41.300 0.000 362.000
genres_last_90d 4.200 2.400 0.000 8.000
support_tickets_last_90d 0.800 1.400 0.000 23.000
hours_last_30d_trend -1.240 9.870 -78.400 92.600
usage_declining 0.310 0.463 0.000 1.000
hours_per_tenure_month 5.872 8.124 0.000 120.400
tickets_per_hour 0.062 0.187 0.000 12.000
device_x_genre_diversity 5.834 5.192 0.000 48.000
new_and_inactive 0.024 0.153 0.000 1.000
plan_changes 0.680 1.020 0.000 8.000
has_downgraded 0.124 0.330 0.000 1.000
device_count 1.840 0.920 1.000 6.000
is_multi_device 0.621 0.485 0.000 1.000
genre_diversity_score 0.134 0.087 0.000 1.000
log_hours_last_30d 2.528 1.142 0.000 4.797
log_tenure_months 2.364 0.844 0.000 4.290
That is 29 features engineered from raw subscription data, event logs, and domain knowledge. In the exercises, you will push this to 15+ unique feature concepts (some of these 29 are variants of the same concept).
Feature Engineering Without Data Leakage
This section is the most important section in the chapter. Everything we have built above is useless if we violate the train/test boundary.
The Golden Rule
Every feature must be computable using only information available at the time of prediction.
This rule has two implications:
-
No future information. If you are predicting churn as of January 31, 2025, you cannot use any data from February 2025. This is obvious, but it is easy to violate when your event table includes all dates and you forget to filter.
-
No target information. Feature statistics computed from the training set must not include information from the test set. If you compute the mean of a feature for normalization, you compute it on the training set only. If you use target encoding, you compute the target means on the training set only.
The Leakage Checklist
Before training any model, walk through this checklist for every feature:
def leakage_audit(feature_name, description):
"""
Document each feature's relationship to the prediction timeline.
For each feature, answer:
1. When is this information generated? (Before/at/after prediction time?)
2. Does this feature use target information? (Direct or indirect?)
3. Could this feature be computed in real-time at prediction time?
"""
print(f"Feature: {feature_name}")
print(f"Description: {description}")
print("---")
# Examples
leakage_audit(
'days_since_last_login',
'Available: Yes. Computed from last_login_date, which is known at prediction time.'
)
leakage_audit(
'hours_last_30d',
'Available: Yes. Computed from events in the 30 days before prediction date.'
)
leakage_audit(
'cancellation_reason_length',
'Available: NO. The cancellation reason is written WHEN the user cancels. '
'Using it to predict cancellation is using the target to predict the target.'
)
leakage_audit(
'avg_hours_watched_all_users',
'Potential leakage: If computed on the full dataset including test set, '
'this leaks test-set information. Compute only on training data.'
)
Proper Feature Engineering Pipeline
The correct approach is to build your feature engineering inside a pipeline that respects the train/test split:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin
class TemporalFeatureEngineer(BaseEstimator, TransformerMixin):
"""
Custom transformer that creates temporal features.
Wrapping feature engineering in a sklearn transformer ensures
it runs inside cross-validation and respects the train/test boundary.
"""
def __init__(self, prediction_date=None):
self.prediction_date = prediction_date
def fit(self, X, y=None):
# No fitting needed for deterministic temporal features
return self
def transform(self, X):
X = X.copy()
pred_date = self.prediction_date or pd.Timestamp.now()
if 'signup_date' in X.columns:
X['tenure_months'] = (
(pred_date - X['signup_date']).dt.days / 30.44
)
if 'last_login_date' in X.columns:
X['days_since_last_login'] = (
pred_date - X['last_login_date']
).dt.days
return X
class StatisticalFeatureEngineer(BaseEstimator, TransformerMixin):
"""
Custom transformer for features that require fitting on training data.
Example: target encoding, mean/std normalization of group features.
"""
def __init__(self, smoothing=10):
self.smoothing = smoothing
self.encoding_maps_ = {}
self.global_mean_ = None
def fit(self, X, y):
self.global_mean_ = y.mean()
# Fit target encoding for categorical columns
for col in X.select_dtypes(include='object').columns:
stats = pd.DataFrame({'target': y, 'cat': X[col]}).groupby('cat')['target'].agg(['mean', 'count'])
self.encoding_maps_[col] = (
(stats['count'] * stats['mean'] + self.smoothing * self.global_mean_) /
(stats['count'] + self.smoothing)
)
return self
def transform(self, X):
X = X.copy()
for col, mapping in self.encoding_maps_.items():
X[f'{col}_target_enc'] = X[col].map(mapping).fillna(self.global_mean_)
return X
Production Tip --- In production ML systems, feature engineering functions are versioned alongside model code. When you retrain a model, you must use the same feature engineering code that was used during the previous training. A mismatch between training-time and serving-time feature computation is the #1 cause of silent model degradation. Feature stores (Chapter 10) exist specifically to solve this problem.
Feature Validation: Sanity Checks Before Modeling
Before you hand your feature matrix to any model, run these sanity checks:
def validate_feature_matrix(features, target):
"""
Run sanity checks on the feature matrix before modeling.
Catches common problems: leakage, constant features,
extreme cardinality, and suspiciously high correlations.
"""
print(f"Shape: {features.shape}")
print(f"Target rate: {target.mean():.4f}")
print()
# Check for constant or near-constant features
nunique = features.nunique()
constant = nunique[nunique <= 1]
if len(constant) > 0:
print(f"WARNING: Constant features (remove these): {list(constant.index)}")
# Check for features with very high missing rates
missing_rate = features.isnull().mean()
high_missing = missing_rate[missing_rate > 0.5]
if len(high_missing) > 0:
print(f"WARNING: >50% missing: {dict(high_missing.round(3))}")
# Check for suspiciously high AUC (possible leakage)
from sklearn.metrics import roc_auc_score
print("\nPer-feature AUC (check for leakage if any > 0.90):")
for col in features.select_dtypes(include=[np.number]).columns:
valid_mask = features[col].notna()
if valid_mask.sum() > 100:
auc = roc_auc_score(
target[valid_mask],
features[col][valid_mask]
)
# Flag features with suspiciously high predictive power
flag = " <<< INVESTIGATE" if auc > 0.90 or auc < 0.10 else ""
if auc > 0.70 or auc < 0.30 or flag:
print(f" {col:35s} AUC: {auc:.4f}{flag}")
# Check for high inter-feature correlation
numeric_features = features.select_dtypes(include=[np.number])
corr_matrix = numeric_features.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
high_corr = [(col, row, corr_matrix.loc[row, col])
for col in upper.columns for row in upper.index
if upper.loc[row, col] > 0.95]
if high_corr:
print(f"\nWARNING: Highly correlated pairs (>0.95):")
for col1, col2, corr in high_corr:
print(f" {col1} <-> {col2}: {corr:.3f}")
validate_feature_matrix(feature_matrix, y)
Shape: (2400000, 29)
Target rate: 0.0820
Per-feature AUC (check for leakage if any > 0.90):
days_since_last_login AUC: 0.7842
active_last_7d AUC: 0.7318
hours_last_30d AUC: 0.7124
sessions_last_30d AUC: 0.7056
usage_declining AUC: 0.7012
hours_last_30d_trend AUC: 0.3184
WARNING: Highly correlated pairs (>0.95):
hours_last_30d <-> log_hours_last_30d: 0.967
sessions_last_30d <-> sessions_last_90d: 0.952
The output is clean. No single feature has an AUC above 0.90, which would suggest leakage. The top feature (days_since_last_login, AUC 0.784) is strong but not suspiciously so --- it is simply a great feature. The correlated pairs make sense (the log transform is monotonically related to the raw feature; 30-day sessions is a subset of 90-day sessions). You may want to drop one from each pair to reduce redundancy, but tree-based models generally handle this gracefully.
The Feature Engineering Mindset
Let us step back from the code and talk about the thinking behind it.
The features we built in this chapter were not discovered through random exploration. They were derived from a systematic process:
-
We asked what a human expert would look at. A customer success manager would check: When did this person last log in? Are they using the product less? Have they contacted support? Are they new or established? Do they use the product on multiple devices? These questions became features.
-
We encoded time at multiple scales. Seven-day, 30-day, and 90-day windows capture different aspects of behavior. A 7-day window is sensitive but noisy. A 90-day window is stable but slow to react. Giving the model multiple scales lets it learn which time horizon matters.
-
We computed ratios, not just counts. Tickets per hour is more informative than raw ticket count. Hours per tenure month is more informative than total hours. Ratios normalize for exposure and make comparisons fair.
-
We captured trends, not just snapshots. The direction of change is often more predictive than the absolute level. A subscriber declining from 30 hours/month to 10 hours/month is in trouble, even though 10 hours/month is still above average.
-
We created interaction features that encode domain logic. "New and inactive" is not just tenure < 3 and hours = 0. It is the specific combination that customer success teams know is an emergency signal.
This is the craft of feature engineering. It is not automatable --- not yet, anyway. Feature stores can automate the computation. AutoML can automate the modeling. But the decision about which features to create still requires a human who understands both the data and the domain.
War Story --- A payments company spent six months building a fraud detection model with 200 features, mostly engineered by data scientists who had never seen a fraudulent transaction. The model achieved an AUC of 0.82. Then they invited two fraud analysts to a feature brainstorming session. In 90 minutes, the analysts suggested eight features: "transaction amount as a multiple of the user's average", "time since last successful transaction", "number of failed transactions in the last hour", "device fingerprint match rate", and four others. Those eight features, added to a simple gradient boosted tree, pushed the AUC to 0.94. The analysts did not know what an AUC was. They knew what fraud looked like.
Progressive Project Milestone M2: Engineer StreamFlow Features
Your task for this milestone:
-
Start from the data you extracted in the Chapter 5 SQL milestone (M1).
-
Engineer at least 15 features across these categories: - Tenure (at least 2 features): account age, tenure bucket, first-90-day flag - Recency (at least 2 features): days since last login, days since last action - Frequency (at least 3 features): sessions in multiple windows, hours in multiple windows - Trends (at least 2 features): usage change, declining/increasing flags - Ratios (at least 2 features): tickets per hour, hours per tenure month - Interactions (at least 2 features): domain-motivated combinations - Support (at least 1 feature): ticket count in a recent window - Engagement diversity (at least 1 feature): genre diversity score
-
Run the validation checks: verify no feature has an AUC > 0.95 (leakage check), no constant features, and no unexpected missing values.
-
Compare model performance: train a simple logistic regression and a gradient boosted tree on raw features vs. engineered features. Document the AUC improvement.
-
Write a brief (1-page) feature dictionary describing each feature, its business logic, and how it is computed.
Try It --- The best features are the ones not listed above. Think about what StreamFlow-specific signals we have not captured. Does the user share their account? (Multiple simultaneous sessions.) Did they recently change their password? (Security concern or account hand-off.) Do they use the mobile app or only the web? (Mobile users may be more engaged.) Each insight is a candidate feature. The ones you think of yourself will teach you more than the ones in this chapter.
Summary
Feature engineering is where data science becomes a craft. It is the process of translating domain knowledge, business logic, and human intuition into computable variables that give your model the information it needs to make accurate predictions.
The key techniques --- temporal features (recency, frequency, tenure), mathematical transformations (log, Box-Cox, polynomial), interaction features, ratio features, and target encoding --- are tools. The real skill is knowing which tool to use and why.
We demonstrated that engineered features lift model performance far more than algorithm choice. A logistic regression with good features beat a gradient boosted tree with raw features. This is the lesson that separates practicing data scientists from tutorial followers: invest your time in features first, algorithms second.
We also established the critical discipline of preventing data leakage in feature engineering. Every feature must be computable at prediction time. Every statistic must be computed only on training data. Feature engineering inside a cross-validation pipeline is not optional --- it is the difference between a model that works in a notebook and a model that works in production.
In the next chapter, we will tackle categorical data --- one of the most common and most mishandled feature types in applied machine learning. The encoding decisions you make for categorical features will interact with everything you learned here.
Next: Chapter 7 --- Handling Categorical Data Previous: Chapter 5 --- SQL for Data Scientists