Chapter 2 Exercises

DataField.Dev

Chapter 2 Exercises

The Machine Learning Workflow

These exercises cover problem framing, data leakage detection, baseline establishment, train/test strategy, and workflow design. Work through them in order — each builds on skills from the previous.

Exercise 1: Problem Framing from a Vague Request

A hospital administrator says: "We need AI to reduce readmissions."

Write a problem framing document (following the template from Section 2.2) that includes:

A precise target variable definition (what exactly counts as a readmission? within what time window? does an ER visit count? what about planned readmissions for follow-up surgery?)
The observation unit (what does one row represent?)
The prediction timing (at what moment does the hospital need the prediction?)
What features would be available vs. excluded at the prediction time you chose
What action the hospital would take based on predictions (be specific — who does what, and how many patients can they handle?)
At least two success metrics (one offline, one online)
At least one feature that would be tempting to include but constitutes data leakage, with an explanation of why

Your document should be approximately 300-400 words. Use the StreamFlow problem framing document from Section 2.2 as a template.

Exercise 2: Target Variable Ambiguity

For each business question below, propose two different target variable definitions and explain how the choice between them would change the model's behavior and usefulness.

(a) "Predict which customers will default on their loan."

(b) "Predict which employees will leave the company."

(c) "Predict which products will go viral on social media."

For each pair, identify which definition you would recommend and why.

Exercise 3: Build the Stupid Baselines

Using the code template below, implement four baseline models for a churn classification problem. Report accuracy, precision, recall, F1, and AUC-ROC for each.

import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
)

# Generate synthetic churn data (8% churn rate)
np.random.seed(42)
n = 50000
X = np.random.randn(n, 10)
y = (np.random.rand(n) < 0.08).astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Implement these four baselines:
# 1. Majority class (DummyClassifier with strategy="most_frequent")
# 2. Stratified random (DummyClassifier with strategy="stratified")
# 3. Uniform random (DummyClassifier with strategy="uniform")
# 4. Prior probability (DummyClassifier with strategy="prior")
#
# YOUR CODE HERE

Questions to answer: - Which baseline has the highest accuracy? Why is this misleading? - Which metric best reveals how useless the majority-class baseline is? - What AUC-ROC value should every real model beat?

Exercise 4: Spot the Leakage

Each scenario below contains data leakage. For each one: 1. Identify the type of leakage (target, temporal, or train/test contamination) 2. Explain why it is leakage (what information is unavailable at prediction time?) 3. Describe the fix

(a) A model predicts whether a flight will be delayed. One of the features is actual_arrival_time.

(b) A team builds a churn model. They normalize all features using the mean and standard deviation computed across the entire dataset, then perform a train/test split.

(c) A model predicts whether a student will pass a final exam. Features include midterm_grade, homework_average, and final_exam_study_hours. The dataset is constructed by surveying students after the semester ends.

(d) A recommendation model is trained on all user-product interactions from 2023. It is evaluated on a random 20% holdout from the same data. In production (January 2024), it performs significantly worse.

(e) A fraud detection model includes transaction_is_disputed as a feature. Disputes are filed by customers after they notice a fraudulent charge — usually 1-5 days after the transaction.

(f) A model predicts employee attrition. The team uses SMOTE to oversample the minority class (employees who left), then performs a train/test split. The model achieves AUC-ROC of 0.91, but performs at 0.72 in production.

(g) A model predicts loan default. One feature is credit_score_at_collection, which is the borrower's credit score retrieved by the collections team after a payment was missed.

Exercise 5: Temporal vs. Random Split

You have a dataset spanning January 2023 through December 2023. Compare the following two splitting strategies:

Strategy A: Random 80/20 split, stratified by the target.

Strategy B: Temporal split — train on Jan-Sep, validate on Oct, test on Nov-Dec.

Write code to implement both strategies and answer:

Why does Strategy B better simulate production conditions?
Under what circumstances might Strategy A actually be acceptable?
If Strategy B shows significantly worse performance than Strategy A, what does that tell you about your data?

import pandas as pd
import numpy as np

# Create a synthetic dataset with temporal structure
np.random.seed(42)
dates = pd.date_range("2023-01-01", "2023-12-31", freq="D")
n_per_day = 100

rows = []
for date in dates:
    for _ in range(n_per_day):
        # Churn rate increases over time (concept drift)
        base_churn_rate = 0.05 + 0.005 * (date.month - 1)
        rows.append({
            "date": date,
            "feature_1": np.random.randn(),
            "feature_2": np.random.randn(),
            "churned": int(np.random.rand() < base_churn_rate),
        })

df = pd.DataFrame(rows)
print(f"Dataset: {len(df):,} rows, {df['date'].min()} to {df['date'].max()}")
print(f"Monthly churn rates:")
print(df.groupby(df["date"].dt.month)["churned"].mean().round(3))

# Implement Strategy A and Strategy B here
# Compare AUC-ROC for a logistic regression model under each strategy
# YOUR CODE HERE

Exercise 6: The Metric Hierarchy

You are the lead data scientist at StreamFlow. The VP of Customer Success, the CFO, and the Chief Product Officer each want a different success metric for the churn model:

VP of Customer Success: "Maximize recall. I want to catch every subscriber who might churn."
CFO: "Maximize precision. Every retention offer costs us money. I only want to target subscribers who are truly going to churn."
CPO: "Maximize the model's AUC-ROC. That is the most comprehensive metric."

Write a one-page memo (in markdown) that:

Explains why each stakeholder's preferred metric is reasonable from their perspective
Explains the tradeoff between them using the precision-recall tradeoff
Proposes a metric hierarchy with primary, secondary, and guardrail metrics, with justification for your choices
Gives a concrete numeric example: assume the model can operate at (a) 90% recall, 15% precision or (b) 40% recall, 60% precision. Calculate the business impact of each scenario given StreamFlow's numbers: 196,800 monthly churners, 15,000 contact capacity, $12.50 monthly revenue per subscriber, 30% retention success rate with discount offer
Recommend which operating point you would choose and why

Exercise 7: Workflow Redesign

A junior data scientist on your team presents this workflow for the StreamFlow churn project:

1. Query all data from the data warehouse into one big DataFrame
2. Drop all columns with any missing values
3. One-hot encode all categorical variables
4. Normalize all features using StandardScaler
5. Split into 80% train, 20% test
6. Train XGBoost with default parameters
7. Report accuracy on the test set
8. Deploy to production

Identify at least six problems with this workflow. For each problem:

Explain what could go wrong in practice
Describe the correct approach
Rank the severity as critical (model will fail in production), moderate (model will underperform), or minor (bad practice but may not cause immediate failure)

Rewrite the workflow as a corrected 12-15 step process.

Exercise 8: Cross-Validation Design

Implement 5-fold stratified cross-validation for the StreamFlow churn problem. Then implement time-series cross-validation with expanding windows. Compare the two approaches.

from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score
import numpy as np

# Use the synthetic dataset from Exercise 5 (or generate new data)
# Implement both CV strategies
# Report mean and std of AUC-ROC for each
# YOUR CODE HERE

Questions: 1. Which CV strategy gives higher AUC-ROC estimates? Why? 2. Which estimate is more honest about expected production performance? 3. Why must time-series CV use expanding (or sliding) windows rather than random folds?

Exercise 9: Feature Availability Audit

Below is a list of potential features for the StreamFlow churn model. For each, classify it as:

Safe: Available at prediction time, no leakage risk
Risky: Potentially available but could introduce leakage if computed incorrectly (explain the specific risk)
Excluded: Not available at prediction time or directly encodes the target (explain why)

Explain your reasoning for each feature in 1-2 sentences. Remember: the prediction is made on the first day of the month, covering the next 30 days.

Feature	Classification	Reasoning
`tenure_months`
`hours_watched_last_30d`
`cancellation_reason`
`plan_type`
`days_since_last_login`
`total_hours_watched_all_time`
`retention_offer_accepted`
`support_tickets_this_month`
`content_library_size`
`competitor_signup`
`next_month_hours_watched`
`payment_failed_this_billing_cycle`
`avg_session_duration_last_90d`
`num_plan_downgrades_last_12m`
`subscriber_satisfaction_survey`

After completing the audit, identify which "Risky" features you would include with safeguards vs. exclude entirely. What safeguards would you implement?

Exercise 10: The Business Value Question

StreamFlow's churn rate is 8.2% (approximately 196,800 subscribers churn per month from 2.4 million). The average monthly revenue per subscriber is $12.50. The retention team can contact 15,000 subscribers per month with a 20% discount offer (reducing revenue from $12.50 to $10.00 for 3 months). Historical data suggests that 30% of subscribers who receive a retention offer and were truly going to churn will stay.

Calculate:

(a) Monthly revenue lost to churn (without any intervention).

(b) If the model perfectly identifies the top 15,000 churners (100% precision among contacted), how much monthly revenue is recovered? Account for the discount cost.

(c) If the model has 40% precision among the top 15,000 flagged subscribers, how much monthly revenue is recovered? How does this compare to random selection?

(d) At what precision level does the model-guided intervention break even compared to no intervention?

Show your work with Python code.

# Starter code for Exercise 10
monthly_subscribers = 2_400_000
churn_rate = 0.082
avg_monthly_revenue = 12.50
retention_capacity = 15_000
discount_rate = 0.20
discount_duration_months = 3
retention_success_rate = 0.30  # 30% of truly-churning subscribers stay if offered discount

# (a) Monthly revenue lost to churn (no intervention)
monthly_churners = int(monthly_subscribers * churn_rate)
monthly_revenue_lost = monthly_churners * avg_monthly_revenue
print(f"(a) Monthly churners: {monthly_churners:,}")
print(f"    Monthly revenue lost: ${monthly_revenue_lost:,.0f}")

# (b) Perfect precision model — YOUR CODE HERE

# (c) 40% precision model — YOUR CODE HERE
# Compare to random selection: what precision would random get?

# (d) Break-even precision — YOUR CODE HERE

Exercise 11: End-to-End Workflow Design

You are tasked with building an ML system for a new domain. Choose one of the following:

Loan default prediction for a credit union with 50,000 members
Equipment failure prediction for a fleet of 500 delivery trucks
Student dropout prediction for an online learning platform with 200,000 learners

Write a complete workflow plan that covers all eight stages from Section 2.1. For each stage, provide:

The specific decisions you would make
At least one risk or potential failure mode
How you would mitigate that risk

Your plan should be 800-1,200 words. Be specific — do not just restate the general framework. Apply it to the domain you chose.

Exercise 12: Monitoring Dashboard Design

Design a monitoring dashboard for the deployed StreamFlow churn model. Your dashboard should include:

Four metrics to display, with refresh frequency for each (daily? weekly? monthly?). Justify why each metric needs its specific frequency.
Three automated alerts with specific trigger conditions (include numeric thresholds, e.g., "alert if KS statistic for hours_watched_last_30d exceeds 0.15")
Two visualizations: one technical (for the data science team) and one non-technical (for the VP of Customer Success). Describe what each shows and why it matters.
The retraining trigger: under what specific conditions would you retrain the model? Define both "automatic retraining" conditions and "manual investigation" conditions.

Sketch the dashboard layout (ASCII art or structured description). Include a sample alert message that would be sent to the team's Slack channel when one of your alerts fires.

Exercise 13: The Leakage Investigation (Challenge)

The following code trains a model and achieves suspiciously high AUC-ROC. There are three sources of leakage. Find all three.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

np.random.seed(42)
n = 10000
df = pd.DataFrame({
    "tenure_months": np.random.exponential(12, n),
    "hours_watched_30d": np.random.exponential(20, n),
    "support_tickets_90d": np.random.poisson(1.5, n),
    "plan_type": np.random.choice(["Basic", "Standard", "Premium"], n),
})

# Target: churn
df["churned"] = (
    (df["hours_watched_30d"] < 5) & (df["support_tickets_90d"] > 2)
).astype(int)

# Feature that only exists after churn
df["cancellation_survey_score"] = np.where(
    df["churned"] == 1,
    np.random.randint(1, 4, n),
    np.nan,
)

# Feature computed on future data
df["hours_watched_next_month"] = df["hours_watched_30d"] * np.random.uniform(0.5, 1.5, n)
df.loc[df["churned"] == 1, "hours_watched_next_month"] = (
    df.loc[df["churned"] == 1, "hours_watched_30d"] * 0.1
)

# Preprocessing: fit scaler on ALL data before splitting
features = [
    "tenure_months",
    "hours_watched_30d",
    "support_tickets_90d",
    "cancellation_survey_score",
    "hours_watched_next_month",
]
df[features] = df[features].fillna(-1)

scaler = StandardScaler()
df[features] = scaler.fit_transform(df[features])

# Now split
X_train, X_test, y_train, y_test = train_test_split(
    df[features], df["churned"], test_size=0.2, random_state=42
)

model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.3f}")

For each source of leakage: 1. Name the type (target, temporal, or train/test contamination) 2. Identify the specific line(s) of code 3. Explain why it is leakage 4. Show how to fix it