Exercises: Chapter 6

DataField.Dev

Exercises: Chapter 6

Feature Engineering

Exercise 1: The Human Expert Test (Conceptual)

You are building a model to predict 30-day hospital readmission for patients discharged from Metro General Hospital. Before writing any code, list 10 features a senior nurse or physician would look at when discharging a patient and assessing their readmission risk. For each feature:

a) Name the feature and describe it in plain English.

b) Classify it as recency, frequency, severity, demographic, or behavioral.

c) Describe how you would compute it from a hospital's electronic health record (EHR) database.

d) Identify one potential data quality issue for that feature.

Exercise 2: Temporal Feature Windows (Applied)

Given the following StreamFlow subscriber event data:

import pandas as pd

events = pd.DataFrame({
    'subscriber_id': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'],
    'event_date': pd.to_datetime([
        '2025-01-28', '2025-01-20', '2025-01-05', '2024-12-15', '2024-11-10',
        '2025-01-30', '2025-01-29', '2025-01-28'
    ]),
    'hours_watched': [1.5, 2.0, 3.5, 4.0, 5.0, 0.5, 0.3, 0.8],
    'event_type': ['watch', 'watch', 'watch', 'watch', 'watch',
                   'watch', 'watch', 'watch']
})

prediction_date = pd.Timestamp('2025-01-31')

By hand (or with code), compute the following for each subscriber:

a) sessions_last_7d, sessions_last_30d, sessions_last_90d

b) hours_last_7d, hours_last_30d, hours_last_90d

c) hours_change_30d (hours in last 30 days minus hours in the 30 days before that)

d) Which subscriber shows a declining usage trend? Which shows a stable or increasing trend?

Exercise 3: Ratio vs. Raw Features (Applied)

Two StreamFlow subscribers have the following profiles:

Feature	Subscriber X	Subscriber Y
Tenure (months)	24	2
Total support tickets	6	6
Total hours watched	480	40
Hours last 30 days	20	20

a) Compute tickets_per_tenure_month for each subscriber. Who has the higher support burden relative to their tenure?

b) Compute hours_per_tenure_month for each subscriber. Who is the more intense user relative to their tenure?

c) A model using only raw features (total tickets = 6, total hours last 30d = 20) would treat these subscribers identically on those dimensions. Explain why the ratio features give the model a more accurate picture.

d) Write a Python function that takes a DataFrame with tenure_months, support_ticket_count, and hours_last_30d columns and returns a DataFrame with the ratio features added.

Exercise 4: Leakage Detection (Conceptual)

For each of the following features in a 30-day churn prediction model, classify it as safe, potential leakage, or definite leakage. Explain your reasoning.

a) days_since_last_login computed as of the prediction date

b) cancellation_reason_text from the user's cancellation survey

c) avg_hours_last_30d computed on the 30 days before the prediction date

d) plan_type_at_prediction --- the user's plan at the time of prediction

e) avg_churn_rate_by_genre computed on the entire dataset (train + test)

f) num_logins_next_7d --- login count in the 7 days after the prediction date

g) support_ticket_sentiment_score from tickets filed before the prediction date

h) monthly_revenue_change computed using the billing record from the month the churn decision was made

Exercise 5: Log Transformation (Applied)

The total_hours_watched feature for StreamFlow subscribers has the following distribution statistics:

mean:     312.4
median:   186.2
std:      428.7
skewness:   3.2
min:        0.0
max:    12847.0

a) Is this feature right-skewed, left-skewed, or approximately symmetric? How can you tell from the summary statistics alone?

b) Apply np.log1p() to the following values and show the results: 0, 1, 10, 100, 1000, 10000.

c) Explain why np.log1p() is preferred over np.log() when the feature contains zeros.

d) Would you apply a log transformation to this feature if your model is a random forest? Why or why not?

e) Would you apply a log transformation if your model is logistic regression? Why or why not?

Exercise 6: Interaction Features with Business Logic (Applied)

Create interaction features for the following scenario:

StreamFlow has identified three behavioral segments: - Power users: hours_last_30d > 40 and genres_last_30d > 4 - Casual users: hours_last_30d between 5 and 40 - Dormant users: hours_last_30d < 5

Write Python code that:

a) Creates a user_segment categorical feature based on these rules.

b) Creates a dormant_with_tickets interaction feature (binary: 1 if dormant AND has support tickets in last 90 days).

c) Creates a power_user_declining interaction feature (binary: 1 if power user AND hours_change_30d < -10).

d) Explain in 2-3 sentences why dormant_with_tickets might be a stronger churn signal than either is_dormant or has_tickets alone.

Exercise 7: Date Feature Engineering (Applied)

Given a DataFrame with a signup_date column, write a function that extracts the following features:

a) signup_day_of_week (0=Monday through 6=Sunday)

b) signup_is_weekend (binary)

c) signup_month_sin and signup_month_cos (cyclical encoding)

d) signup_is_january (binary --- new year's resolution sign-ups often have high churn)

e) days_since_signup as of a given prediction date

Test your function on these dates: 2024-01-02 (Tuesday), 2024-07-04 (Thursday), 2024-12-25 (Wednesday), 2024-11-29 (Friday, Black Friday).

For part (c), explain why cyclical encoding is preferable to using the raw month number (1-12) as a feature for linear models.

Exercise 8: Target Encoding from Scratch (Applied)

Given the following training data:

subscriber_id	plan_type	churned
1	basic	1
2	basic	0
3	basic	1
4	basic	0
5	premium	0
6	premium	0
7	premium	1
8	enterprise	0
9	enterprise	0

a) Compute the naive target encoding (no smoothing) for each plan type.

b) Compute the smoothed target encoding with smoothing = 5. The formula is: (count * category_mean + smoothing * global_mean) / (count + smoothing). Show your work.

c) What is the global churn rate? Which plan type's encoding changes the most with smoothing, and why?

d) Explain why the enterprise encoding is pulled most strongly toward the global mean.

e) Now suppose a new plan type "student" appears in the test data that was never seen in training. What value should you assign? Why?

Exercise 9: Feature Engineering for TurbineTech (Applied)

TurbineTech has sensor data from 1,200 wind turbines with 847 sensors each. For this exercise, focus on two sensors: vibration (Hz) and temperature (Celsius).

Given 24 hours of readings at 10-minute intervals (144 readings per sensor per turbine):

import numpy as np

np.random.seed(42)
n_readings = 144

# Normal turbine
normal_vibration = np.random.normal(loc=45.0, scale=2.0, size=n_readings)
normal_temperature = np.random.normal(loc=62.0, scale=1.5, size=n_readings)

# Failing turbine (vibration increasing, temperature spiking)
failing_vibration = np.linspace(44, 58, n_readings) + np.random.normal(0, 1.5, n_readings)
failing_temperature = np.concatenate([
    np.random.normal(62, 1.5, 120),
    np.random.normal(72, 3.0, 24)  # spike in last 4 hours
])

Write code to compute the following features for each turbine:

a) rolling_mean_vibration_1h (mean of last 6 readings)

b) rolling_std_vibration_1h (standard deviation of last 6 readings)

c) rate_of_change_temperature (slope of temperature over last 2 hours, i.e., last 12 readings)

d) vibration_temperature_correlation (Pearson correlation over the full 24 hours)

e) max_vibration_deviation (maximum absolute deviation from the 24-hour mean)

Compare the feature values for the normal vs. failing turbine. Which features best distinguish them?

Exercise 10: Feature Importance Quick Check (Applied)

After engineering your features, you want a quick check of which ones are most predictive before training a full model. Write a function that:

a) Computes the AUC of each individual numeric feature against the binary target.

b) Flags any feature with AUC > 0.90 as a potential leakage risk.

c) Flags any feature with AUC between 0.48 and 0.52 as potentially uninformative.

d) Returns a sorted DataFrame of features and their individual AUCs.

Test it on a synthetic dataset where one feature is the target with noise added (simulating leakage) and another is pure random noise.

Exercise 11: The Feature Dictionary (Applied)

Create a feature dictionary for the StreamFlow churn model. For each of the following 10 features, document:

Feature name (the column name in your DataFrame)
Business definition (what it means in plain English)
Computation logic (how it is computed, including any window periods)
Data source (which table(s) it comes from)
Expected range (min/max or typical values)
Known issues (missing data, edge cases, potential biases)

Features to document: tenure_months, days_since_last_login, hours_last_30d, support_tickets_last_90d, hours_change_30d, genre_diversity_score, tickets_per_hour, device_count, is_first_90_days, usage_declining.

Production Tip --- In production systems, the feature dictionary IS the documentation. When a new team member asks "what is genre_diversity_score?", the answer should be one click away. Invest in this document early --- it pays dividends every time someone questions why the model made a specific prediction.

Exercise 12: Preventing Train/Test Leakage (Applied)

You have the following feature engineering code:

# Step 1: Compute global statistics
global_mean_hours = df['hours_last_30d'].mean()
global_std_hours = df['hours_last_30d'].std()

# Step 2: Normalize
df['hours_normalized'] = (df['hours_last_30d'] - global_mean_hours) / global_std_hours

# Step 3: Target encode plan_type
plan_means = df.groupby('plan_type')['churned'].mean()
df['plan_encoded'] = df['plan_type'].map(plan_means)

# Step 4: Split
X_train, X_test, y_train, y_test = train_test_split(
    df[['hours_normalized', 'plan_encoded']], df['churned'],
    test_size=0.2, random_state=42
)

a) Identify all sources of data leakage in this code.

b) Rewrite the code to eliminate all leakage. The split must happen before any statistics are computed.

c) Explain why this type of leakage is especially dangerous: the model will appear to work well in evaluation but fail in production.

Exercise 13: Feature Engineering Sprint (Challenge)

This exercise simulates a real feature engineering sprint. You have 60 minutes (timed).

Starting from raw StreamFlow subscriber data with these columns: - subscriber_id, signup_date, plan_type, plan_price, last_login_date - total_hours_watched, num_devices, primary_genre - support_ticket_count, last_ticket_date - billing_failures, last_billing_failure_date - referral_source, country

Engineer as many features as you can. For each feature: 1. Name it 2. Write the one-line computation (pandas) 3. Classify it (recency / frequency / tenure / ratio / interaction / transformation / categorical)

Target: 20 features in 60 minutes. Stretch goal: 30 features.

After the sprint, rank your features by predicted usefulness (your intuition, not a model). Compare your ranking to the per-feature AUC ranking from Exercise 10.

These exercises support Chapter 6: Feature Engineering. Return to the chapter for full context.