19 min read

> War Story --- In 2021, a mid-size SaaS company was losing subscribers at an alarming rate. The data science team spent three weeks experimenting with neural network architectures --- LSTMs over usage sequences, attention mechanisms over...

Chapter 6: Feature Engineering

The Art and Science of Creating Predictive Variables


Learning Objectives

By the end of this chapter, you will be able to:

  1. Generate features from raw data using domain knowledge
  2. Create temporal features (recency, frequency, tenure, trends)
  3. Apply mathematical transformations (log, polynomial, interaction terms)
  4. Extract features from dates, text, and structured data
  5. Use target encoding and other advanced techniques responsibly

War Story --- In 2021, a mid-size SaaS company was losing subscribers at an alarming rate. The data science team spent three weeks experimenting with neural network architectures --- LSTMs over usage sequences, attention mechanisms over clickstream data, autoencoders for representation learning. Their best model achieved an AUC of 0.78 on the holdout set. Then a new hire, fresh from a customer success role, asked a simple question: "Do we know when each user last logged in?" The team added a single feature --- days_since_last_login --- to a gradient boosted tree. The AUC jumped to 0.84. One feature, five minutes of work, and it outperformed three weeks of architecture search.

The lesson is not that neural networks are bad. The lesson is that feature engineering is where domain knowledge meets data, and domain knowledge almost always wins. The customer success manager knew that login recency was the single strongest signal of disengagement because she had watched hundreds of customers ghost their way to cancellation. She did not need an LSTM to discover what she already knew. She needed someone to ask her.

That war story should recalibrate your intuition about where to invest your time. In a typical data science project, feature engineering accounts for 60-80% of the predictive lift. Not the algorithm. Not the hyperparameters. The features. Andrew Ng has said it plainly: "Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering."

This chapter is about that craft. It is about the systematic process of translating domain knowledge, business logic, and raw intuition into computable variables that give your model the information it needs to make good predictions. We will work primarily with the StreamFlow churn dataset you extracted in Chapter 5, and we will build features that a subscription business expert would recognize as meaningful --- because that is the point.

Feature engineering is not a mechanical process. It is thinking, with code.


Why Features Matter More Than Algorithms

Before we write a single line of feature engineering code, let us establish why this chapter is arguably the most important one in the book.

Consider a simple experiment. We take the raw StreamFlow subscriber data --- subscriber_id, signup_date, plan_type, last_login_date, total_hours_watched, support_ticket_count, device_list --- and we train three models:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load the raw StreamFlow subscriber data
df = pd.read_csv('streamflow_subscribers.csv', parse_dates=['signup_date', 'last_login_date'])

# Minimal features: just the raw numeric columns
raw_features = ['total_hours_watched', 'support_ticket_count', 'num_devices']
X_raw = df[raw_features].fillna(0)
y = df['churned_within_30_days']

X_train, X_test, y_train, y_test = train_test_split(
    X_raw, y, test_size=0.2, random_state=42, stratify=y
)

# Three models, same raw features
models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('model', LogisticRegression(random_state=42))
    ]),
    'Random Forest': RandomForestClassifier(n_estimators=200, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=200, random_state=42)
}

print("=== Raw Features (3 features) ===")
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    print(f"{name:25s}  AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")
=== Raw Features (3 features) ===
Logistic Regression        AUC: 0.641 (+/- 0.012)
Random Forest              AUC: 0.687 (+/- 0.015)
Gradient Boosting          AUC: 0.702 (+/- 0.014)

Now watch what happens when we engineer features:

# Engineered features from the same raw data
df['tenure_months'] = (
    (pd.Timestamp('2025-01-31') - df['signup_date']).dt.days / 30.44
).round(1)

df['days_since_last_login'] = (
    pd.Timestamp('2025-01-31') - df['last_login_date']
).dt.days

df['hours_per_tenure_month'] = (
    df['total_hours_watched'] / df['tenure_months'].clip(lower=1)
)

df['tickets_per_tenure_month'] = (
    df['support_ticket_count'] / df['tenure_months'].clip(lower=1)
)

df['is_multi_device'] = (df['num_devices'] > 1).astype(int)

engineered_features = [
    'tenure_months', 'days_since_last_login', 'total_hours_watched',
    'hours_per_tenure_month', 'support_ticket_count',
    'tickets_per_tenure_month', 'num_devices', 'is_multi_device'
]

X_eng = df[engineered_features].fillna(0)
X_train_eng, X_test_eng, _, _ = train_test_split(
    X_eng, y, test_size=0.2, random_state=42, stratify=y
)

print("\n=== Engineered Features (8 features) ===")
for name, model in models.items():
    scores = cross_val_score(model, X_train_eng, y_train, cv=5, scoring='roc_auc')
    print(f"{name:25s}  AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")
=== Engineered Features (8 features) ===
Logistic Regression        AUC: 0.761 (+/- 0.009)
Random Forest              AUC: 0.803 (+/- 0.011)
Gradient Boosting          AUC: 0.819 (+/- 0.010)

The logistic regression with engineered features (0.761) outperforms the gradient boosting with raw features (0.702). The simple model with good features beats the complex model with bad features. This is the single most important lesson in applied machine learning, and it is the reason this chapter exists.

Production Tip --- When your model is not performing well, your first instinct should be to engineer better features, not to try a fancier algorithm. Algorithm shopping is the procrastination of data science. Feature engineering is the work.


The Feature Engineering Recipe

Feature engineering is not random exploration. It follows a systematic process:

  1. Understand the domain. Talk to subject matter experts. Read the business wiki. Shadow a customer success manager. The best features come from people who understand the problem, not from people who understand gradient boosting.

  2. Ask: "What would a human expert look at?" If you were a StreamFlow customer success manager trying to predict which subscribers would cancel, what would you check? Their recent usage? How long they have been a customer? Whether they have contacted support? Whether their usage is trending up or down? Every answer to this question is a candidate feature.

  3. Translate intuition into computable variables. "Usage is declining" becomes avg_hours_last_30d - avg_hours_prior_30d. "They seem disengaged" becomes days_since_last_login. "They have had a bad experience" becomes support_tickets_last_90d.

  4. Validate features against the target. Compute correlation, mutual information, or simple AUC of each feature individually. Features that have no univariate relationship with the target are unlikely to help (though interaction effects can be exceptions).

  5. Respect the train/test boundary. Every feature computation must be reproducible at prediction time using only information available at that moment. We will return to this critical constraint at the end of the chapter.

Let us work through each category of feature engineering technique.


Temporal Features: Recency, Frequency, and Tenure

Temporal features are the workhorses of subscription and behavioral models. They capture when things happened, how often, and for how long. If you remember nothing else from this chapter, remember RFT: Recency, Frequency, Tenure.

Recency Features

Recency measures how recently a user took an action. In subscription businesses, recency is often the single strongest predictor of churn. A user who logged in yesterday is dramatically less likely to cancel than one who has not logged in for three weeks.

import pandas as pd
import numpy as np

# Assume prediction_date is the date we make predictions
prediction_date = pd.Timestamp('2025-01-31')

def create_recency_features(df, prediction_date):
    """Create recency features from event timestamps."""

    features = pd.DataFrame(index=df.index)

    # Days since last login
    features['days_since_last_login'] = (
        prediction_date - df['last_login_date']
    ).dt.days

    # Days since last support ticket
    features['days_since_last_ticket'] = (
        prediction_date - df['last_ticket_date']
    ).dt.days

    # Days since last plan change (upgrade or downgrade)
    features['days_since_plan_change'] = (
        prediction_date - df['last_plan_change_date']
    ).dt.days

    # Days since account creation (this is tenure, but it is also a recency metric)
    features['account_age_days'] = (
        prediction_date - df['signup_date']
    ).dt.days

    # Binary: logged in within last 7 days?
    features['active_last_7d'] = (
        features['days_since_last_login'] <= 7
    ).astype(int)

    # Binary: logged in within last 30 days?
    features['active_last_30d'] = (
        features['days_since_last_login'] <= 30
    ).astype(int)

    return features

recency = create_recency_features(df, prediction_date)
print(recency.describe().round(1))
       days_since_last_login  days_since_last_ticket  days_since_plan_change  account_age_days  active_last_7d  active_last_30d
count              2400000.0               2400000.0               2400000.0         2400000.0       2400000.0        2400000.0
mean                    12.4                    87.3                   241.6             438.2             0.7              0.9
std                     18.7                   112.4                   198.3             312.5             0.5              0.3
min                      0.0                     0.0                     0.0               1.0             0.0              0.0
25%                      2.0                    14.0                    89.0             187.0             0.0              1.0
50%                      5.0                    48.0                   184.0             364.0             1.0              1.0
75%                     14.0                   118.0                   342.0             621.0             1.0              1.0
max                    365.0                  1095.0                  1825.0            2190.0             1.0              1.0

Notice the distribution of days_since_last_login. The median is 5 days, but the 75th percentile is 14 days and the max is 365. That long tail contains your high-risk churners. The binary threshold features (active_last_7d, active_last_30d) give your model clean cutpoints to work with.

Frequency Features

Frequency measures how often a user takes an action within a defined window. The window matters. A user who watched 100 hours in their first month but zero in the last month is very different from a user who watches 10 hours every month.

def create_frequency_features(df, events_df, prediction_date):
    """
    Create frequency features from an events table.

    events_df has columns: subscriber_id, event_date, event_type,
                           hours_watched, genre
    """
    features = pd.DataFrame(index=df.index)

    # Filter events before prediction date (no future leakage)
    events = events_df[events_df['event_date'] < prediction_date].copy()

    # --- Usage frequency by window ---
    for window_days, label in [(7, '7d'), (30, '30d'), (90, '90d')]:
        cutoff = prediction_date - pd.Timedelta(days=window_days)
        window_events = events[events['event_date'] >= cutoff]

        usage = window_events.groupby('subscriber_id').agg(
            sessions=('event_date', 'nunique'),
            total_hours=('hours_watched', 'sum'),
            unique_genres=('genre', 'nunique')
        ).rename(columns={
            'sessions': f'sessions_last_{label}',
            'total_hours': f'hours_last_{label}',
            'unique_genres': f'genres_last_{label}'
        })

        features = features.join(usage, how='left')

    # --- Support ticket frequency ---
    for window_days, label in [(30, '30d'), (90, '90d')]:
        cutoff = prediction_date - pd.Timedelta(days=window_days)
        tickets = events[
            (events['event_type'] == 'support_ticket') &
            (events['event_date'] >= cutoff)
        ]
        ticket_counts = tickets.groupby('subscriber_id').size()
        features[f'support_tickets_last_{label}'] = ticket_counts

    # Fill NaN with 0 (no events = zero frequency)
    features = features.fillna(0)

    return features

freq = create_frequency_features(df, events_df, prediction_date)
print(freq[['sessions_last_7d', 'sessions_last_30d', 'hours_last_30d',
            'genres_last_30d', 'support_tickets_last_90d']].describe().round(1))
       sessions_last_7d  sessions_last_30d  hours_last_30d  genres_last_30d  support_tickets_last_90d
count         2400000.0          2400000.0       2400000.0        2400000.0                 2400000.0
mean                3.2               11.4            18.7              3.1                       0.8
std                 2.4                7.8            14.2              2.0                       1.4
min                 0.0                0.0             0.0              0.0                       0.0
25%                 1.0                5.0             8.2              2.0                       0.0
50%                 3.0               10.0            15.6              3.0                       0.0
75%                 5.0               16.0            26.3              4.0                       1.0
max                 7.0               30.0           120.4              8.0                      23.0

Common Mistake --- Computing frequency features over the entire history of a user without windowing. A user who watched 500 hours over 3 years but zero in the last month is not an active user. Always window your frequency features, and always create multiple windows (7-day, 30-day, 90-day) so the model can learn which time horizon matters most.

Tenure Features

Tenure is how long a customer has been a customer. In subscription businesses, churn risk is highest in the first 90 days (the "danger zone") and lowest for subscribers who have survived their first year.

def create_tenure_features(df, prediction_date):
    """Create tenure features from signup date."""
    features = pd.DataFrame(index=df.index)

    # Raw tenure
    features['tenure_days'] = (prediction_date - df['signup_date']).dt.days
    features['tenure_months'] = (features['tenure_days'] / 30.44).round(1)
    features['tenure_years'] = (features['tenure_days'] / 365.25).round(2)

    # Tenure buckets (the "danger zone" pattern)
    features['tenure_bucket'] = pd.cut(
        features['tenure_months'],
        bins=[0, 1, 3, 6, 12, 24, float('inf')],
        labels=['0-1mo', '1-3mo', '3-6mo', '6-12mo', '12-24mo', '24mo+'],
        right=True
    )

    # Binary: is the subscriber in the high-risk first 90 days?
    features['is_first_90_days'] = (features['tenure_days'] <= 90).astype(int)

    return features

tenure = create_tenure_features(df, prediction_date)
print(tenure['tenure_bucket'].value_counts().sort_index())
0-1mo       98400
1-3mo      192000
3-6mo      312000
6-12mo     504000
12-24mo    648000
24mo+      645600
Name: tenure_bucket, dtype: int64

RFM Features for Subscription Data

RFM --- Recency, Frequency, Monetary --- is a framework from direct marketing that translates perfectly to subscription analytics. For StreamFlow, we adapt "Monetary" to "engagement value" since all subscribers in a plan tier pay the same price:

def create_rfm_features(df, events_df, prediction_date):
    """
    Create RFM (Recency, Frequency, Monetary/Engagement) features.

    For subscription businesses, 'Monetary' is adapted to engagement
    metrics since revenue per user is tied to plan tier.
    """
    features = pd.DataFrame(index=df.index)

    # --- Recency (lower = more recent = better) ---
    features['rfm_recency'] = (
        prediction_date - df['last_login_date']
    ).dt.days

    # --- Frequency (last 90 days) ---
    cutoff_90d = prediction_date - pd.Timedelta(days=90)
    recent_events = events_df[events_df['event_date'] >= cutoff_90d]
    session_counts = recent_events.groupby('subscriber_id')['event_date'].nunique()
    features['rfm_frequency'] = session_counts.reindex(df.index, fill_value=0)

    # --- Monetary (avg hours per session, last 90 days) ---
    engagement = recent_events.groupby('subscriber_id').agg(
        total_hours=('hours_watched', 'sum'),
        total_sessions=('event_date', 'nunique')
    )
    engagement['avg_hours_per_session'] = (
        engagement['total_hours'] / engagement['total_sessions'].clip(lower=1)
    )
    features['rfm_monetary'] = engagement['avg_hours_per_session'].reindex(
        df.index, fill_value=0
    )

    # --- RFM Score (quintile-based) ---
    # Recency: lower is better, so reverse the scoring
    features['rfm_recency_score'] = pd.qcut(
        features['rfm_recency'], q=5, labels=[5, 4, 3, 2, 1]
    ).astype(int)

    features['rfm_frequency_score'] = pd.qcut(
        features['rfm_frequency'].rank(method='first'), q=5, labels=[1, 2, 3, 4, 5]
    ).astype(int)

    features['rfm_monetary_score'] = pd.qcut(
        features['rfm_monetary'].rank(method='first'), q=5, labels=[1, 2, 3, 4, 5]
    ).astype(int)

    # Combined RFM score (simple sum)
    features['rfm_total_score'] = (
        features['rfm_recency_score'] +
        features['rfm_frequency_score'] +
        features['rfm_monetary_score']
    )

    return features

rfm = create_rfm_features(df, events_df, prediction_date)
print(rfm[['rfm_recency', 'rfm_frequency', 'rfm_monetary', 'rfm_total_score']].describe().round(2))
       rfm_recency  rfm_frequency  rfm_monetary  rfm_total_score
count   2400000.00     2400000.00    2400000.00       2400000.00
mean         12.40          24.80          1.64           9.00
std          18.70          18.20          0.89           3.46
min           0.00           0.00          0.00           3.00
25%           2.00          10.00          1.02           7.00
50%           5.00          22.00          1.58           9.00
75%          14.00          36.00          2.18          11.00
max         365.00          90.00          8.42          15.00

Production Tip --- In production systems, pre-compute RFM features nightly in SQL (Chapter 5) and store them in a feature store. Your model training pipeline reads from the store; your real-time prediction service reads from the same store. This guarantees that the features used in training are identical to the features used in serving.


Trend Features: Capturing Behavioral Change

Static features tell you what a user looks like right now. Trend features tell you whether they are getting better or worse. A subscriber who watched 20 hours last month is very different depending on whether they watched 25 hours the month before (declining) or 15 hours (increasing).

def create_trend_features(df, events_df, prediction_date):
    """
    Create trend features by comparing recent behavior to prior behavior.

    The key insight: the direction of change is often more predictive
    than the absolute level.
    """
    features = pd.DataFrame(index=df.index)

    # Define two periods
    recent_start = prediction_date - pd.Timedelta(days=30)
    prior_start = prediction_date - pd.Timedelta(days=60)

    recent_events = events_df[
        (events_df['event_date'] >= recent_start) &
        (events_df['event_date'] < prediction_date)
    ]
    prior_events = events_df[
        (events_df['event_date'] >= prior_start) &
        (events_df['event_date'] < recent_start)
    ]

    # Hours watched in each period
    recent_hours = recent_events.groupby('subscriber_id')['hours_watched'].sum()
    prior_hours = prior_events.groupby('subscriber_id')['hours_watched'].sum()

    features['hours_last_30d'] = recent_hours.reindex(df.index, fill_value=0)
    features['hours_prior_30d'] = prior_hours.reindex(df.index, fill_value=0)

    # Absolute change
    features['hours_change_30d'] = (
        features['hours_last_30d'] - features['hours_prior_30d']
    )

    # Percentage change (with safe division)
    features['hours_pct_change_30d'] = np.where(
        features['hours_prior_30d'] > 0,
        (features['hours_change_30d'] / features['hours_prior_30d']) * 100,
        np.where(features['hours_last_30d'] > 0, 100.0, 0.0)
    )
    # Clip extreme percentage changes
    features['hours_pct_change_30d'] = features['hours_pct_change_30d'].clip(-100, 500)

    # Session frequency trend
    recent_sessions = recent_events.groupby('subscriber_id')['event_date'].nunique()
    prior_sessions = prior_events.groupby('subscriber_id')['event_date'].nunique()

    features['sessions_last_30d'] = recent_sessions.reindex(df.index, fill_value=0)
    features['sessions_prior_30d'] = prior_sessions.reindex(df.index, fill_value=0)
    features['sessions_change_30d'] = (
        features['sessions_last_30d'] - features['sessions_prior_30d']
    )

    # Binary trend indicator
    features['usage_declining'] = (features['hours_change_30d'] < -2).astype(int)
    features['usage_increasing'] = (features['hours_change_30d'] > 2).astype(int)

    return features

trends = create_trend_features(df, events_df, prediction_date)
print(trends[['hours_change_30d', 'hours_pct_change_30d',
              'sessions_change_30d', 'usage_declining']].describe().round(2))
       hours_change_30d  hours_pct_change_30d  sessions_change_30d  usage_declining
count        2400000.00            2400000.00           2400000.00       2400000.00
mean              -1.24                 -4.68                -0.42             0.31
std                9.87                 42.34                 4.18             0.46
min              -78.40               -100.00               -22.00             0.00
25%               -5.20                -28.57                -2.00             0.00
50%               -0.40                 -2.14                 0.00             0.00
75%                3.10                 18.75                 1.00             1.00
max               92.60                500.00                24.00             1.00

Notice that the mean hours_change_30d is -1.24 --- on average, usage is slightly declining across the subscriber base. This makes sense for a service with 8.2% monthly churn. The usage_declining binary flag captures the 31% of subscribers whose usage dropped by more than 2 hours. This is a clean, interpretable signal.

Try It --- Create a three-period trend feature: compare the last 30 days to the prior 30 days and to the 30 days before that. A subscriber declining for two consecutive periods is a much stronger churn signal than one declining for a single period. You can encode this as a categorical: "accelerating decline", "recent decline", "stable", "recent growth", "accelerating growth".


Mathematical Transformations

Raw features often have distributions that make it hard for models to learn effectively. Mathematical transformations reshape these distributions to improve model performance.

Log Transformation

The log transformation compresses the right tail of skewed distributions. It is the most common transformation in data science, and for good reason: many real-world variables (revenue, page views, session duration) follow approximately log-normal distributions.

import matplotlib.pyplot as plt

# Total hours watched is right-skewed
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(df['total_hours_watched'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_title('Total Hours Watched (Raw)')
axes[0].set_xlabel('Hours')

# Log transform (add 1 to handle zeros)
df['log_total_hours'] = np.log1p(df['total_hours_watched'])
axes[1].hist(df['log_total_hours'], bins=50, edgecolor='black', alpha=0.7)
axes[1].set_title('Total Hours Watched (Log-Transformed)')
axes[1].set_xlabel('log(1 + Hours)')

plt.tight_layout()
plt.savefig('log_transform_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

When should you use a log transformation?

  • Right-skewed distributions with a long tail (revenue, counts, durations)
  • Multiplicative relationships that you want to make additive
  • Features with orders-of-magnitude variation (a user with 10 hours and a user with 10,000 hours)

When should you not use it?

  • Features with zeros or negative values (use np.log1p() for zeros; avoid log for negative values)
  • Already-normal distributions (the transformation will skew them)
  • Tree-based models (trees split on rank order, so monotonic transformations do not change the splits)
# Log transformation is most helpful for linear models
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

# Compare raw vs. log-transformed for logistic regression
features_raw = df[['total_hours_watched', 'support_ticket_count', 'tenure_days']].fillna(0)
features_log = features_raw.copy()
features_log['total_hours_watched'] = np.log1p(features_log['total_hours_watched'])
features_log['support_ticket_count'] = np.log1p(features_log['support_ticket_count'])

X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    features_raw, y, test_size=0.2, random_state=42, stratify=y
)
X_train_log, X_test_log, _, _ = train_test_split(
    features_log, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
lr = LogisticRegression(random_state=42)

# Raw
lr.fit(scaler.fit_transform(X_train_raw), y_train)
auc_raw = roc_auc_score(y_test, lr.predict_proba(scaler.transform(X_test_raw))[:, 1])

# Log-transformed
lr.fit(scaler.fit_transform(X_train_log), y_train)
auc_log = roc_auc_score(y_test, lr.predict_proba(scaler.transform(X_test_log))[:, 1])

print(f"Logistic Regression AUC (raw):             {auc_raw:.4f}")
print(f"Logistic Regression AUC (log-transformed): {auc_log:.4f}")
Logistic Regression AUC (raw):             0.6812
Logistic Regression AUC (log-transformed): 0.7134

Box-Cox Transformation

The Box-Cox transformation is a generalization of the log transform. It finds the optimal power transformation to make a distribution as close to normal as possible:

from scipy import stats

# Box-Cox requires strictly positive values
hours_positive = df['total_hours_watched'][df['total_hours_watched'] > 0]

# Find optimal lambda
transformed, fitted_lambda = stats.boxcox(hours_positive)
print(f"Optimal lambda: {fitted_lambda:.4f}")
# lambda = 0 corresponds to log; lambda = 1 means no transformation
Optimal lambda: 0.1247

A lambda near 0 confirms that the log transform is close to optimal for this feature. In practice, np.log1p() is usually sufficient, and Box-Cox is reserved for cases where you need the extra precision.

from sklearn.preprocessing import PowerTransformer

# PowerTransformer handles the Box-Cox transformation in a sklearn pipeline
# Yeo-Johnson variant handles zeros and negative values
pt = PowerTransformer(method='yeo-johnson', standardize=True)

features_to_transform = ['total_hours_watched', 'support_ticket_count', 'tenure_days']
X_transformed = pt.fit_transform(df[features_to_transform].fillna(0))
print(f"Lambdas: {dict(zip(features_to_transform, pt.lambdas_.round(4)))}")
Lambdas: {'total_hours_watched': 0.1384, 'support_ticket_count': 0.3012, 'tenure_days': 0.7821}

Common Mistake --- Fitting the PowerTransformer on the entire dataset including the test set. The optimal lambda must be computed only on the training data, then applied to the test data using transform(). Fitting on the full dataset leaks information from the test set into your feature engineering.

Polynomial Features

Polynomial features create non-linear terms from your existing features. They are most useful for linear models that cannot learn non-linear relationships on their own.

from sklearn.preprocessing import PolynomialFeatures

# Create quadratic features (degree=2)
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

base_features = df[['tenure_months', 'hours_last_30d', 'support_tickets_last_90d']].fillna(0)
poly_features = poly.fit_transform(base_features)

print(f"Input features: {base_features.shape[1]}")
print(f"Output features: {poly_features.shape[1]}")
print(f"Feature names: {poly.get_feature_names_out()}")
Input features: 3
Output features: 9
Feature names: ['tenure_months' 'hours_last_30d' 'support_tickets_last_90d'
 'tenure_months^2' 'tenure_months hours_last_30d'
 'tenure_months support_tickets_last_90d' 'hours_last_30d^2'
 'hours_last_30d support_tickets_last_90d' 'support_tickets_last_90d^2']

Three features become nine. With degree=3, they would become 19. With 10 input features and degree=2, you get 65 outputs. Polynomial features scale combinatorially, which means you need to be selective.

Production Tip --- Use interaction_only=True if you want interaction terms (tenure * hours) without squared terms (tenure^2). This is often the better choice because interaction terms capture "these two things together" while squared terms capture "more of this thing has a non-linear effect."

Interaction Features

Interaction features capture the combined effect of two variables. They encode the insight that the effect of one variable depends on the value of another.

def create_interaction_features(df):
    """
    Create interaction features that encode domain knowledge.

    Each interaction should have a business interpretation.
    Do not blindly generate all pairwise interactions.
    """
    features = pd.DataFrame(index=df.index)

    # Tenure * usage trend: new users with declining usage are highest risk
    features['tenure_x_usage_decline'] = (
        df['tenure_months'] * df['hours_change_30d']
    )

    # Support tickets per hour watched: high ratio = frustrated user
    features['tickets_per_hour'] = np.where(
        df['hours_last_30d'] > 0,
        df['support_tickets_last_90d'] / df['hours_last_30d'],
        df['support_tickets_last_90d']  # if no hours, just use ticket count
    )

    # Device count * genre diversity: multi-device, multi-genre = engaged household
    features['device_x_genre_diversity'] = (
        df['num_devices'] * df['genres_last_30d']
    )

    # Is the user a new subscriber (< 3 months) with zero recent usage?
    features['new_and_inactive'] = (
        (df['tenure_months'] < 3) & (df['hours_last_30d'] == 0)
    ).astype(int)

    return features

interactions = create_interaction_features(df)
print(interactions.describe().round(3))
       tenure_x_usage_decline  tickets_per_hour  device_x_genre_diversity  new_and_inactive
count             2400000.000       2400000.000               2400000.000       2400000.000
mean                  -18.446             0.062                     5.834             0.024
std                   189.723             0.187                     5.192             0.153
min                 -4896.000             0.000                     0.000             0.000
25%                   -48.360             0.000                     2.000             0.000
50%                    -3.640             0.000                     4.000             0.000
75%                    26.460             0.047                     8.000             0.000
max                  7200.000            12.000                    48.000             1.000

The key rule for interaction features: every interaction should have a business interpretation. If you cannot explain why two features interact to someone who understands the business, the interaction is probably noise. "Support tickets per hour watched" makes sense --- it captures frustration intensity relative to engagement. "Tenure times device count" is less clear --- what does it mean for a 12-month subscriber to have 3 devices? That interaction should earn its place through empirical validation, not just combinatorial generation.


Date Feature Extraction

Dates contain far more information than a single timestamp. A subscription that started on January 2nd behaves differently from one that started on December 26th, and a user who signed up on a Monday is statistically different from one who signed up on a Saturday.

def create_date_features(df, date_col='signup_date'):
    """Extract features from a datetime column."""
    features = pd.DataFrame(index=df.index)
    dt = df[date_col]

    # Calendar features
    features[f'{date_col}_month'] = dt.dt.month
    features[f'{date_col}_day_of_week'] = dt.dt.dayofweek  # 0=Mon, 6=Sun
    features[f'{date_col}_day_of_month'] = dt.dt.day
    features[f'{date_col}_quarter'] = dt.dt.quarter
    features[f'{date_col}_is_weekend'] = (dt.dt.dayofweek >= 5).astype(int)

    # Cyclical encoding for month (so December and January are close)
    features[f'{date_col}_month_sin'] = np.sin(2 * np.pi * dt.dt.month / 12)
    features[f'{date_col}_month_cos'] = np.cos(2 * np.pi * dt.dt.month / 12)

    # Cyclical encoding for day of week
    features[f'{date_col}_dow_sin'] = np.sin(2 * np.pi * dt.dt.dayofweek / 7)
    features[f'{date_col}_dow_cos'] = np.cos(2 * np.pi * dt.dt.dayofweek / 7)

    # Business-specific: did they sign up during a promotion period?
    promo_periods = [
        (pd.Timestamp('2024-11-25'), pd.Timestamp('2024-12-02')),  # Black Friday
        (pd.Timestamp('2024-01-01'), pd.Timestamp('2024-01-07')),  # New Year
    ]
    features[f'{date_col}_during_promo'] = 0
    for start, end in promo_periods:
        features.loc[
            (dt >= start) & (dt <= end), f'{date_col}_during_promo'
        ] = 1

    return features

date_features = create_date_features(df, 'signup_date')
print(date_features.head(3).T)
                              0         1         2
signup_date_month           3.0       7.0      11.0
signup_date_day_of_week     2.0       4.0       1.0
signup_date_day_of_month   15.0      22.0       8.0
signup_date_quarter         1.0       3.0       4.0
signup_date_is_weekend      0.0       0.0       0.0
signup_date_month_sin       1.0       0.5      -0.9
signup_date_month_cos       0.0      -0.9      -0.5
signup_date_dow_sin         0.8      -0.4       0.8
signup_date_dow_cos        -0.2      -0.2      -0.2
signup_date_during_promo    0.0       0.0       0.0

Common Mistake --- Using month as a raw integer (1-12). This tells the model that December (12) is 12 times January (1) and that December is far from January, when cyclically they are adjacent. Cyclical encoding (sine/cosine) preserves the circular relationship. This matters most for linear models. Tree-based models can handle raw integers because they split on thresholds, not on magnitude.


Ratio and Normalized Features

Raw counts are often less informative than ratios. "This user filed 5 support tickets" is ambiguous. Is that a lot? It depends on how long they have been a customer and how much they use the product.

def create_ratio_features(df):
    """
    Create ratio features that normalize raw counts by exposure.

    Ratios control for the 'opportunity to have the event' and
    make comparisons fair across users with different tenures
    and usage levels.
    """
    features = pd.DataFrame(index=df.index)

    # Usage intensity: hours per month of tenure
    features['hours_per_tenure_month'] = np.where(
        df['tenure_months'] > 0,
        df['total_hours_watched'] / df['tenure_months'],
        0
    )

    # Support burden: tickets per month of tenure
    features['tickets_per_tenure_month'] = np.where(
        df['tenure_months'] > 0,
        df['support_ticket_count'] / df['tenure_months'],
        0
    )

    # Engagement depth: hours per session (recent)
    features['hours_per_session_30d'] = np.where(
        df['sessions_last_30d'] > 0,
        df['hours_last_30d'] / df['sessions_last_30d'],
        0
    )

    # Genre exploration rate: unique genres / total sessions
    features['genre_diversity_score'] = np.where(
        df['sessions_last_90d'] > 0,
        df['genres_last_90d'] / df['sessions_last_90d'],
        0
    )

    # Device utilization: devices used last 30 days / total registered devices
    features['device_utilization'] = np.where(
        df['num_devices'] > 0,
        df['active_devices_last_30d'] / df['num_devices'],
        0
    )

    # Weekend usage ratio: what fraction of usage happens on weekends?
    features['weekend_usage_ratio'] = np.where(
        df['hours_last_30d'] > 0,
        df['weekend_hours_last_30d'] / df['hours_last_30d'],
        0
    )

    return features

ratios = create_ratio_features(df)
print(ratios.describe().round(3))
       hours_per_tenure_month  tickets_per_tenure_month  hours_per_session_30d  genre_diversity_score  device_utilization  weekend_usage_ratio
count             2400000.000               2400000.000            2400000.000            2400000.000         2400000.000          2400000.000
mean                    1.842                     0.067                  1.621                  0.134               0.724                0.312
std                     1.406                     0.098                  0.894                  0.087               0.289                0.148
min                     0.000                     0.000                  0.000                  0.000               0.000                0.000
25%                     0.912                     0.000                  1.024                  0.071               0.500                0.214
50%                     1.583                     0.034                  1.487                  0.125               0.750                0.298
75%                     2.404                     0.091                  2.048                  0.182               1.000                0.394
max                    24.680                     2.333                 12.040                  1.000               1.000                1.000

A note on genre_diversity_score: this feature captures how broadly a subscriber explores the content library. A subscriber who watches only one genre is more vulnerable to a single bad content decision. A subscriber who explores widely has more reasons to stay. This kind of feature --- which requires domain intuition to create --- is exactly what separates data science from data engineering.


Text Feature Extraction (Preview)

StreamFlow subscribers write support tickets and leave plan-cancellation reasons in free text. Even simple text features can be powerful:

def create_text_features(df, text_col='cancellation_reason'):
    """
    Extract simple features from text fields.

    Full NLP feature engineering is covered in Chapter 26.
    Here we extract basic signals.
    """
    features = pd.DataFrame(index=df.index)

    text = df[text_col].fillna('')

    # Basic: does the field have any text?
    features[f'{text_col}_has_text'] = (text.str.len() > 0).astype(int)

    # Length features
    features[f'{text_col}_char_count'] = text.str.len()
    features[f'{text_col}_word_count'] = text.str.split().str.len().fillna(0).astype(int)

    # Keyword flags
    negative_keywords = ['expensive', 'cost', 'price', 'waste', 'terrible', 'awful', 'hate']
    competitor_keywords = ['netflix', 'spotify', 'youtube', 'switched', 'alternative']

    features[f'{text_col}_mentions_price'] = (
        text.str.lower().str.contains('|'.join(negative_keywords), na=False)
    ).astype(int)

    features[f'{text_col}_mentions_competitor'] = (
        text.str.lower().str.contains('|'.join(competitor_keywords), na=False)
    ).astype(int)

    return features

# Only applicable to subscribers who have initiated cancellation
# or left feedback --- not available for current active subscribers
# Use with caution: this feature may only be available AFTER churn intent is visible

Common Mistake --- Using cancellation reason text as a feature to predict churn. The cancellation reason is only written when the user is already canceling --- it is target leakage by definition. The text features above would only be useful for understanding why users churn (descriptive), not for predicting who will churn (predictive). Be ruthless about the timeline: would this data be available at the moment you need to make the prediction?


Binning: Converting Continuous Features to Categories

Binning (also called discretization) converts continuous features into categorical buckets. It can help when the relationship between a feature and the target is non-linear and step-like rather than smooth.

def create_binned_features(df):
    """
    Bin continuous features into categories.

    Use equal-width bins when the distribution matters.
    Use quantile bins when you want equal-sized groups.
    Use domain-informed bins when you know the breakpoints.
    """
    features = pd.DataFrame(index=df.index)

    # Domain-informed bins: tenure buckets based on known churn patterns
    features['tenure_risk_group'] = pd.cut(
        df['tenure_months'],
        bins=[0, 1, 3, 6, 12, 24, float('inf')],
        labels=['very_high', 'high', 'elevated', 'moderate', 'low', 'very_low']
    )

    # Quantile bins: usage quartiles (equal-frequency)
    features['usage_quartile'] = pd.qcut(
        df['hours_last_30d'], q=4, labels=['Q1_low', 'Q2', 'Q3', 'Q4_high']
    )

    # Equal-width bins: support ticket count
    features['ticket_level'] = pd.cut(
        df['support_tickets_last_90d'],
        bins=[-1, 0, 1, 3, float('inf')],
        labels=['none', 'low', 'moderate', 'high']
    )

    return features

When to bin:

  • When the relationship is clearly non-linear with plateau effects (churn risk drops sharply after 3 months of tenure, then levels off)
  • When you need interpretable segments for business stakeholders ("high-risk tenure group" is more actionable than "tenure coefficient = -0.028")
  • When working with linear models that cannot capture non-linearity otherwise

When not to bin:

  • With tree-based models (they find optimal split points automatically)
  • When the relationship is smoothly linear or monotonic
  • When you are discarding useful variation within each bin

Production Tip --- If you bin features, store the bin edges as configuration, not as code. When your data distribution shifts (and it will), you need to decide whether to keep the original bins or re-compute them. In most production systems, you keep the original bins and monitor for distributional drift.


Target Encoding (Advanced)

Target encoding replaces a categorical value with the mean of the target variable for that category. It is powerful for high-cardinality categoricals but dangerous if done incorrectly.

# Naive target encoding (DO NOT USE IN PRODUCTION)
plan_means_WRONG = df.groupby('plan_type')['churned_within_30_days'].mean()
df['plan_target_enc_WRONG'] = df['plan_type'].map(plan_means_WRONG)
# This leaks the target into the features!

The problem: if you compute the target mean on the entire dataset, the encoded feature contains information from the test set. Worse, for each individual row, the target mean includes that row's own target value. This is leakage.

The correct approach uses cross-validation within the training set:

from sklearn.model_selection import KFold

def target_encode_cv(train_df, test_df, cat_col, target_col, n_splits=5,
                     smoothing=10, random_state=42):
    """
    Target encoding with cross-validation to prevent leakage.

    For each fold of the training data:
    - Compute target mean on the out-of-fold data
    - Apply to the in-fold data

    For test data:
    - Use the global target mean from the full training set

    Smoothing blends the category mean with the global mean,
    regularizing categories with few observations.
    """
    global_mean = train_df[target_col].mean()

    # Initialize encoded columns
    train_encoded = pd.Series(index=train_df.index, dtype=float)

    kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    for train_idx, val_idx in kf.split(train_df):
        # Compute target stats on training fold
        fold_train = train_df.iloc[train_idx]
        stats = fold_train.groupby(cat_col)[target_col].agg(['mean', 'count'])

        # Apply smoothing: blend category mean with global mean
        # More observations -> trust the category mean more
        smoothed_mean = (
            (stats['count'] * stats['mean'] + smoothing * global_mean) /
            (stats['count'] + smoothing)
        )

        # Apply to validation fold
        val_fold = train_df.iloc[val_idx]
        train_encoded.iloc[val_idx] = val_fold[cat_col].map(smoothed_mean)

    # Fill any unmapped categories with global mean
    train_encoded = train_encoded.fillna(global_mean)

    # For test data: use full training set statistics
    stats = train_df.groupby(cat_col)[target_col].agg(['mean', 'count'])
    smoothed_mean = (
        (stats['count'] * stats['mean'] + smoothing * global_mean) /
        (stats['count'] + smoothing)
    )
    test_encoded = test_df[cat_col].map(smoothed_mean).fillna(global_mean)

    return train_encoded, test_encoded

# Usage
X_train_split, X_test_split, y_train_split, y_test_split = train_test_split(
    df, y, test_size=0.2, random_state=42, stratify=y
)

train_enc, test_enc = target_encode_cv(
    X_train_split, X_test_split,
    cat_col='primary_genre',
    target_col='churned_within_30_days',
    n_splits=5,
    smoothing=10,
    random_state=42
)

print("Target-encoded genre (training set sample):")
print(train_enc.head(10).round(4))
Target-encoded genre (training set sample):
0     0.0724
1     0.0891
2     0.0634
3     0.1042
4     0.0724
5     0.0634
6     0.0891
7     0.0724
8     0.1042
9     0.0788
dtype: float64

The smoothing parameter is crucial. Without smoothing, a genre with only 3 subscribers and a 33% churn rate gets an encoded value of 0.33 --- which is noise, not signal. Smoothing pulls this toward the global mean (0.082), giving (3 * 0.33 + 10 * 0.082) / (3 + 10) = 0.139. With more data, the category-specific mean dominates. With little data, the global mean dominates. This is Bayesian shrinkage in disguise.

Common Mistake --- Using target encoding without smoothing on categories with few observations. A category with 2 observations and 100% churn rate is not truly 100% churn --- it is 2 random observations. Smoothing prevents your model from overfitting to small-sample noise.


Putting It All Together: The StreamFlow Feature Matrix

Let us assemble all of our engineered features into a single feature matrix for the progressive project:

def engineer_streamflow_features(df, events_df, prediction_date):
    """
    Master feature engineering function for the StreamFlow churn model.

    Takes raw subscriber data and events, returns a feature matrix.
    This is Milestone M2 of the progressive project.
    """
    features = pd.DataFrame(index=df.index)

    # === Tenure features ===
    features['tenure_months'] = (
        (prediction_date - df['signup_date']).dt.days / 30.44
    ).round(1)
    features['is_first_90_days'] = (features['tenure_months'] < 3).astype(int)

    # === Recency features ===
    features['days_since_last_login'] = (
        prediction_date - df['last_login_date']
    ).dt.days
    features['active_last_7d'] = (features['days_since_last_login'] <= 7).astype(int)

    # === Frequency features (multiple windows) ===
    for window_days, label in [(7, '7d'), (30, '30d'), (90, '90d')]:
        cutoff = prediction_date - pd.Timedelta(days=window_days)
        window_events = events_df[
            (events_df['event_date'] >= cutoff) &
            (events_df['event_date'] < prediction_date)
        ]

        usage = window_events.groupby('subscriber_id').agg(
            sessions=('event_date', 'nunique'),
            total_hours=('hours_watched', 'sum'),
            unique_genres=('genre', 'nunique')
        )

        features[f'sessions_last_{label}'] = usage['sessions'].reindex(
            df.index, fill_value=0
        )
        features[f'hours_last_{label}'] = usage['total_hours'].reindex(
            df.index, fill_value=0
        )
        features[f'genres_last_{label}'] = usage['unique_genres'].reindex(
            df.index, fill_value=0
        )

    # === Support tickets ===
    cutoff_90d = prediction_date - pd.Timedelta(days=90)
    ticket_events = events_df[
        (events_df['event_type'] == 'support_ticket') &
        (events_df['event_date'] >= cutoff_90d) &
        (events_df['event_date'] < prediction_date)
    ]
    features['support_tickets_last_90d'] = (
        ticket_events.groupby('subscriber_id').size()
        .reindex(df.index, fill_value=0)
    )

    # === Trend features ===
    recent_start = prediction_date - pd.Timedelta(days=30)
    prior_start = prediction_date - pd.Timedelta(days=60)

    recent = events_df[
        (events_df['event_date'] >= recent_start) &
        (events_df['event_date'] < prediction_date)
    ].groupby('subscriber_id')['hours_watched'].sum()

    prior = events_df[
        (events_df['event_date'] >= prior_start) &
        (events_df['event_date'] < recent_start)
    ].groupby('subscriber_id')['hours_watched'].sum()

    features['hours_last_30d_trend'] = (
        recent.reindex(df.index, fill_value=0) -
        prior.reindex(df.index, fill_value=0)
    )
    features['usage_declining'] = (features['hours_last_30d_trend'] < -2).astype(int)

    # === Ratio features ===
    features['hours_per_tenure_month'] = np.where(
        features['tenure_months'] > 0,
        features['hours_last_90d'] / (features['tenure_months'].clip(upper=3) * 1),
        0
    )
    features['tickets_per_hour'] = np.where(
        features['hours_last_90d'] > 0,
        features['support_tickets_last_90d'] / features['hours_last_90d'],
        features['support_tickets_last_90d']
    )

    # === Interaction features ===
    features['device_x_genre_diversity'] = (
        df['num_devices'] * features['genres_last_30d']
    )
    features['new_and_inactive'] = (
        (features['tenure_months'] < 3) & (features['hours_last_30d'] == 0)
    ).astype(int)

    # === Plan change features ===
    features['plan_changes'] = df['plan_change_count']
    features['has_downgraded'] = df['has_downgraded'].astype(int)

    # === Device features ===
    features['device_count'] = df['num_devices']
    features['is_multi_device'] = (df['num_devices'] > 1).astype(int)

    # === Genre diversity ===
    features['genre_diversity_score'] = np.where(
        features['sessions_last_90d'] > 0,
        features['genres_last_90d'] / features['sessions_last_90d'],
        0
    )

    # === Log transforms (for linear models) ===
    features['log_hours_last_30d'] = np.log1p(features['hours_last_30d'])
    features['log_tenure_months'] = np.log1p(features['tenure_months'])

    print(f"Engineered {features.shape[1]} features for {features.shape[0]} subscribers")

    return features

# Build the feature matrix
feature_matrix = engineer_streamflow_features(df, events_df, prediction_date)
Engineered 29 features for 2400000 subscribers
# Quick validation: check feature distributions
print(feature_matrix.describe().round(3).T[['mean', 'std', 'min', 'max']])
                            mean       std     min       max
tenure_months             14.387    10.260    0.000    71.900
is_first_90_days           0.121     0.326    0.000     1.000
days_since_last_login     12.400    18.700    0.000   365.000
active_last_7d             0.687     0.464    0.000     1.000
sessions_last_7d           3.200     2.400    0.000     7.000
hours_last_7d              4.340     3.820    0.000    28.600
genres_last_7d             1.800     1.300    0.000     6.000
sessions_last_30d         11.400     7.800    0.000    30.000
hours_last_30d            18.700    14.200    0.000   120.400
genres_last_30d            3.100     2.000    0.000     8.000
sessions_last_90d         28.600    19.400    0.000    90.000
hours_last_90d            52.800    41.300    0.000   362.000
genres_last_90d            4.200     2.400    0.000     8.000
support_tickets_last_90d   0.800     1.400    0.000    23.000
hours_last_30d_trend      -1.240     9.870  -78.400    92.600
usage_declining            0.310     0.463    0.000     1.000
hours_per_tenure_month     5.872     8.124    0.000   120.400
tickets_per_hour           0.062     0.187    0.000    12.000
device_x_genre_diversity   5.834     5.192    0.000    48.000
new_and_inactive           0.024     0.153    0.000     1.000
plan_changes               0.680     1.020    0.000     8.000
has_downgraded             0.124     0.330    0.000     1.000
device_count               1.840     0.920    1.000     6.000
is_multi_device            0.621     0.485    0.000     1.000
genre_diversity_score      0.134     0.087    0.000     1.000
log_hours_last_30d         2.528     1.142    0.000     4.797
log_tenure_months          2.364     0.844    0.000     4.290

That is 29 features engineered from raw subscription data, event logs, and domain knowledge. In the exercises, you will push this to 15+ unique feature concepts (some of these 29 are variants of the same concept).


Feature Engineering Without Data Leakage

This section is the most important section in the chapter. Everything we have built above is useless if we violate the train/test boundary.

The Golden Rule

Every feature must be computable using only information available at the time of prediction.

This rule has two implications:

  1. No future information. If you are predicting churn as of January 31, 2025, you cannot use any data from February 2025. This is obvious, but it is easy to violate when your event table includes all dates and you forget to filter.

  2. No target information. Feature statistics computed from the training set must not include information from the test set. If you compute the mean of a feature for normalization, you compute it on the training set only. If you use target encoding, you compute the target means on the training set only.

The Leakage Checklist

Before training any model, walk through this checklist for every feature:

def leakage_audit(feature_name, description):
    """
    Document each feature's relationship to the prediction timeline.

    For each feature, answer:
    1. When is this information generated? (Before/at/after prediction time?)
    2. Does this feature use target information? (Direct or indirect?)
    3. Could this feature be computed in real-time at prediction time?
    """
    print(f"Feature: {feature_name}")
    print(f"Description: {description}")
    print("---")

# Examples
leakage_audit(
    'days_since_last_login',
    'Available: Yes. Computed from last_login_date, which is known at prediction time.'
)
leakage_audit(
    'hours_last_30d',
    'Available: Yes. Computed from events in the 30 days before prediction date.'
)
leakage_audit(
    'cancellation_reason_length',
    'Available: NO. The cancellation reason is written WHEN the user cancels. '
    'Using it to predict cancellation is using the target to predict the target.'
)
leakage_audit(
    'avg_hours_watched_all_users',
    'Potential leakage: If computed on the full dataset including test set, '
    'this leaks test-set information. Compute only on training data.'
)

Proper Feature Engineering Pipeline

The correct approach is to build your feature engineering inside a pipeline that respects the train/test split:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin

class TemporalFeatureEngineer(BaseEstimator, TransformerMixin):
    """
    Custom transformer that creates temporal features.

    Wrapping feature engineering in a sklearn transformer ensures
    it runs inside cross-validation and respects the train/test boundary.
    """
    def __init__(self, prediction_date=None):
        self.prediction_date = prediction_date

    def fit(self, X, y=None):
        # No fitting needed for deterministic temporal features
        return self

    def transform(self, X):
        X = X.copy()
        pred_date = self.prediction_date or pd.Timestamp.now()

        if 'signup_date' in X.columns:
            X['tenure_months'] = (
                (pred_date - X['signup_date']).dt.days / 30.44
            )
        if 'last_login_date' in X.columns:
            X['days_since_last_login'] = (
                pred_date - X['last_login_date']
            ).dt.days

        return X


class StatisticalFeatureEngineer(BaseEstimator, TransformerMixin):
    """
    Custom transformer for features that require fitting on training data.

    Example: target encoding, mean/std normalization of group features.
    """
    def __init__(self, smoothing=10):
        self.smoothing = smoothing
        self.encoding_maps_ = {}
        self.global_mean_ = None

    def fit(self, X, y):
        self.global_mean_ = y.mean()

        # Fit target encoding for categorical columns
        for col in X.select_dtypes(include='object').columns:
            stats = pd.DataFrame({'target': y, 'cat': X[col]}).groupby('cat')['target'].agg(['mean', 'count'])
            self.encoding_maps_[col] = (
                (stats['count'] * stats['mean'] + self.smoothing * self.global_mean_) /
                (stats['count'] + self.smoothing)
            )

        return self

    def transform(self, X):
        X = X.copy()
        for col, mapping in self.encoding_maps_.items():
            X[f'{col}_target_enc'] = X[col].map(mapping).fillna(self.global_mean_)
        return X

Production Tip --- In production ML systems, feature engineering functions are versioned alongside model code. When you retrain a model, you must use the same feature engineering code that was used during the previous training. A mismatch between training-time and serving-time feature computation is the #1 cause of silent model degradation. Feature stores (Chapter 10) exist specifically to solve this problem.


Feature Validation: Sanity Checks Before Modeling

Before you hand your feature matrix to any model, run these sanity checks:

def validate_feature_matrix(features, target):
    """
    Run sanity checks on the feature matrix before modeling.

    Catches common problems: leakage, constant features,
    extreme cardinality, and suspiciously high correlations.
    """
    print(f"Shape: {features.shape}")
    print(f"Target rate: {target.mean():.4f}")
    print()

    # Check for constant or near-constant features
    nunique = features.nunique()
    constant = nunique[nunique <= 1]
    if len(constant) > 0:
        print(f"WARNING: Constant features (remove these): {list(constant.index)}")

    # Check for features with very high missing rates
    missing_rate = features.isnull().mean()
    high_missing = missing_rate[missing_rate > 0.5]
    if len(high_missing) > 0:
        print(f"WARNING: >50% missing: {dict(high_missing.round(3))}")

    # Check for suspiciously high AUC (possible leakage)
    from sklearn.metrics import roc_auc_score
    print("\nPer-feature AUC (check for leakage if any > 0.90):")
    for col in features.select_dtypes(include=[np.number]).columns:
        valid_mask = features[col].notna()
        if valid_mask.sum() > 100:
            auc = roc_auc_score(
                target[valid_mask],
                features[col][valid_mask]
            )
            # Flag features with suspiciously high predictive power
            flag = " <<< INVESTIGATE" if auc > 0.90 or auc < 0.10 else ""
            if auc > 0.70 or auc < 0.30 or flag:
                print(f"  {col:35s}  AUC: {auc:.4f}{flag}")

    # Check for high inter-feature correlation
    numeric_features = features.select_dtypes(include=[np.number])
    corr_matrix = numeric_features.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    high_corr = [(col, row, corr_matrix.loc[row, col])
                 for col in upper.columns for row in upper.index
                 if upper.loc[row, col] > 0.95]

    if high_corr:
        print(f"\nWARNING: Highly correlated pairs (>0.95):")
        for col1, col2, corr in high_corr:
            print(f"  {col1} <-> {col2}: {corr:.3f}")

validate_feature_matrix(feature_matrix, y)
Shape: (2400000, 29)
Target rate: 0.0820

Per-feature AUC (check for leakage if any > 0.90):
  days_since_last_login              AUC: 0.7842
  active_last_7d                     AUC: 0.7318
  hours_last_30d                     AUC: 0.7124
  sessions_last_30d                  AUC: 0.7056
  usage_declining                    AUC: 0.7012
  hours_last_30d_trend               AUC: 0.3184

WARNING: Highly correlated pairs (>0.95):
  hours_last_30d <-> log_hours_last_30d: 0.967
  sessions_last_30d <-> sessions_last_90d: 0.952

The output is clean. No single feature has an AUC above 0.90, which would suggest leakage. The top feature (days_since_last_login, AUC 0.784) is strong but not suspiciously so --- it is simply a great feature. The correlated pairs make sense (the log transform is monotonically related to the raw feature; 30-day sessions is a subset of 90-day sessions). You may want to drop one from each pair to reduce redundancy, but tree-based models generally handle this gracefully.


The Feature Engineering Mindset

Let us step back from the code and talk about the thinking behind it.

The features we built in this chapter were not discovered through random exploration. They were derived from a systematic process:

  1. We asked what a human expert would look at. A customer success manager would check: When did this person last log in? Are they using the product less? Have they contacted support? Are they new or established? Do they use the product on multiple devices? These questions became features.

  2. We encoded time at multiple scales. Seven-day, 30-day, and 90-day windows capture different aspects of behavior. A 7-day window is sensitive but noisy. A 90-day window is stable but slow to react. Giving the model multiple scales lets it learn which time horizon matters.

  3. We computed ratios, not just counts. Tickets per hour is more informative than raw ticket count. Hours per tenure month is more informative than total hours. Ratios normalize for exposure and make comparisons fair.

  4. We captured trends, not just snapshots. The direction of change is often more predictive than the absolute level. A subscriber declining from 30 hours/month to 10 hours/month is in trouble, even though 10 hours/month is still above average.

  5. We created interaction features that encode domain logic. "New and inactive" is not just tenure < 3 and hours = 0. It is the specific combination that customer success teams know is an emergency signal.

This is the craft of feature engineering. It is not automatable --- not yet, anyway. Feature stores can automate the computation. AutoML can automate the modeling. But the decision about which features to create still requires a human who understands both the data and the domain.

War Story --- A payments company spent six months building a fraud detection model with 200 features, mostly engineered by data scientists who had never seen a fraudulent transaction. The model achieved an AUC of 0.82. Then they invited two fraud analysts to a feature brainstorming session. In 90 minutes, the analysts suggested eight features: "transaction amount as a multiple of the user's average", "time since last successful transaction", "number of failed transactions in the last hour", "device fingerprint match rate", and four others. Those eight features, added to a simple gradient boosted tree, pushed the AUC to 0.94. The analysts did not know what an AUC was. They knew what fraud looked like.


Progressive Project Milestone M2: Engineer StreamFlow Features

Your task for this milestone:

  1. Start from the data you extracted in the Chapter 5 SQL milestone (M1).

  2. Engineer at least 15 features across these categories: - Tenure (at least 2 features): account age, tenure bucket, first-90-day flag - Recency (at least 2 features): days since last login, days since last action - Frequency (at least 3 features): sessions in multiple windows, hours in multiple windows - Trends (at least 2 features): usage change, declining/increasing flags - Ratios (at least 2 features): tickets per hour, hours per tenure month - Interactions (at least 2 features): domain-motivated combinations - Support (at least 1 feature): ticket count in a recent window - Engagement diversity (at least 1 feature): genre diversity score

  3. Run the validation checks: verify no feature has an AUC > 0.95 (leakage check), no constant features, and no unexpected missing values.

  4. Compare model performance: train a simple logistic regression and a gradient boosted tree on raw features vs. engineered features. Document the AUC improvement.

  5. Write a brief (1-page) feature dictionary describing each feature, its business logic, and how it is computed.

Try It --- The best features are the ones not listed above. Think about what StreamFlow-specific signals we have not captured. Does the user share their account? (Multiple simultaneous sessions.) Did they recently change their password? (Security concern or account hand-off.) Do they use the mobile app or only the web? (Mobile users may be more engaged.) Each insight is a candidate feature. The ones you think of yourself will teach you more than the ones in this chapter.


Summary

Feature engineering is where data science becomes a craft. It is the process of translating domain knowledge, business logic, and human intuition into computable variables that give your model the information it needs to make accurate predictions.

The key techniques --- temporal features (recency, frequency, tenure), mathematical transformations (log, Box-Cox, polynomial), interaction features, ratio features, and target encoding --- are tools. The real skill is knowing which tool to use and why.

We demonstrated that engineered features lift model performance far more than algorithm choice. A logistic regression with good features beat a gradient boosted tree with raw features. This is the lesson that separates practicing data scientists from tutorial followers: invest your time in features first, algorithms second.

We also established the critical discipline of preventing data leakage in feature engineering. Every feature must be computable at prediction time. Every statistic must be computed only on training data. Feature engineering inside a cross-validation pipeline is not optional --- it is the difference between a model that works in a notebook and a model that works in production.

In the next chapter, we will tackle categorical data --- one of the most common and most mishandled feature types in applied machine learning. The encoding decisions you make for categorical features will interact with everything you learned here.


Next: Chapter 7 --- Handling Categorical Data Previous: Chapter 5 --- SQL for Data Scientists