Exercises: Chapter 1

DataField.Dev

Exercises: Chapter 1

From Analysis to Prediction

Exercise 1: Classify the Question (Conceptual)

For each of the following questions, classify it as descriptive, inferential, or predictive. Briefly explain your reasoning.

a) What was the average order value for customers who used a coupon code in Q4 2024?

b) Does offering free shipping increase conversion rate for first-time buyers?

c) Which of our current users are most likely to upgrade from the free tier to the paid tier in the next 60 days?

d) How many support tickets did we receive last month, broken down by category?

e) Is there a statistically significant relationship between employee tenure and promotion rate, controlling for department and performance rating?

f) Will this loan applicant default within 12 months?

g) What percentage of patients in the clinical trial experienced adverse events, stratified by dosage group?

h) Which wind turbines in our fleet will require bearing replacement in the next 90 days?

Exercise 2: Frame the ML Problem (Applied)

You are a data scientist at a mid-size e-commerce company. The Head of Marketing says: "We spend $3M/month on paid advertising. I think we're wasting money on customers who would have bought anyway. Can you help?"

Complete the framing checklist:

What is the business question? (Restate in precise terms.)
What is the prediction target? (Define it precisely --- binary, continuous, or categorical?)
What is the observation unit?
What is the prediction horizon?
What features would be available at prediction time?
How will the prediction be used? (What specific action will change based on the model output?)
What is the "stupid baseline" for this problem?
What is one feature that might seem useful but would actually be target leakage?

Exercise 3: Identify the Problem Type (Conceptual)

For each scenario, identify whether it is best framed as classification, regression, clustering, ranking, or anomaly detection. Some scenarios may reasonably support multiple framings --- if so, state your preferred framing and explain why.

a) A streaming service wants to sort its content library so that each user sees the most relevant titles first on their homepage.

b) A manufacturing plant wants to detect when a production line is behaving abnormally before defective products reach quality control.

c) A real estate company wants to estimate the sale price of a property based on its features (square footage, location, bedrooms, etc.).

d) A marketing team wants to group their customer base into segments for targeted email campaigns, but they have no pre-defined segment labels.

e) A hospital wants to determine whether a patient presenting in the emergency department should be admitted, sent to observation, or discharged.

f) An insurance company wants to estimate the dollar amount of a claim given the claim details.

g) A cybersecurity team wants to flag network traffic patterns that deviate from normal behavior.

Exercise 4: The Bias-Variance Tradeoff (Conceptual)

A junior data scientist shows you the following results for three models trained on the same dataset:

Model	Training Accuracy	Test Accuracy
Logistic Regression	78.2%	76.8%
Random Forest (500 trees, max_depth=None)	99.7%	81.3%
Random Forest (100 trees, max_depth=5)	85.4%	83.1%

a) Which model shows the strongest signs of overfitting? How can you tell?

b) Which model shows the strongest signs of underfitting? How can you tell?

c) Which model would you recommend for production deployment, and why?

d) The junior data scientist says: "The second model has the highest test accuracy minus training accuracy gap, but its test accuracy is still higher than logistic regression, so it's fine." What is wrong with this reasoning?

Exercise 5: Target Leakage Detective (Applied)

You are building a model to predict whether a customer will return an online purchase within 30 days of delivery. Your colleague has prepared the following feature set:

features = [
    'product_category',
    'product_price',
    'customer_tenure_days',
    'customer_prior_returns',
    'delivery_time_days',
    'return_shipping_label_generated',  # 1 if a return label was created
    'customer_satisfaction_survey_score',  # Post-delivery survey (1-5)
    'product_rating_at_purchase',
    'payment_method',
    'order_day_of_week',
    'customer_contacted_support_about_order',
]

a) Identify the feature(s) that constitute target leakage. Explain why.

b) Identify any feature(s) that are borderline --- not strictly leakage but potentially problematic. Explain the concern.

c) Suggest two additional features that would be predictive and available at the right time.

Exercise 6: Business Case Math (Applied)

TurbineTech operates 1,200 wind turbines. On average, 3% of turbines experience an unplanned failure each month. The cost breakdown:

Unplanned failure: $48,000/day downtime (average 7 days) + $120,000 emergency repair = $456,000 total
Planned maintenance (if failure predicted in advance): $8,000 + 1 day of planned downtime ($12,000 in low-wind period scheduling) = $20,000 total

Suppose TurbineTech builds a model with the following performance: - Recall: 70% (catches 70% of actual failures) - False positive rate: 5% (5% of healthy turbines are incorrectly flagged)

Calculate:

a) How many turbines fail per month on average?

b) How many failures does the model catch? How many does it miss?

c) How many false alarms does the model generate?

d) What is the monthly cost savings from deploying this model? (Account for caught failures, missed failures, false alarms, and the cost of unnecessary planned maintenance on false positives.)

e) At what recall rate does the model break even (monthly savings = $0), assuming the false positive rate stays at 5%?

Exercise 7: Explore a Churn-Like Dataset (Coding)

Use the Telco Customer Churn dataset (available on Kaggle or via sklearn.datasets). If you cannot access that dataset, generate a synthetic one using the code below.

import pandas as pd
import numpy as np

np.random.seed(42)
n = 5000

df = pd.DataFrame({
    'tenure_months': np.random.exponential(24, n).astype(int).clip(1, 72),
    'monthly_charges': np.round(np.random.uniform(20, 110, n), 2),
    'total_charges': np.nan,  # Will compute
    'contract_type': np.random.choice(['month-to-month', 'one-year', 'two-year'], n, p=[0.5, 0.3, 0.2]),
    'num_support_tickets': np.random.poisson(2, n),
    'num_devices': np.random.choice([1, 2, 3, 4], n, p=[0.4, 0.3, 0.2, 0.1]),
    'online_backup': np.random.choice([0, 1], n, p=[0.4, 0.6]),
})
df['total_charges'] = df['tenure_months'] * df['monthly_charges'] * np.random.uniform(0.9, 1.1, n)

# Generate churn label with realistic relationships
churn_logit = (
    -2.0
    + 0.8 * (df['contract_type'] == 'month-to-month').astype(int)
    - 0.03 * df['tenure_months']
    + 0.01 * df['monthly_charges']
    + 0.15 * df['num_support_tickets']
    - 0.2 * df['num_devices']
    + np.random.normal(0, 0.5, n)
)
df['churned'] = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

print(f"Dataset shape: {df.shape}")
print(f"Churn rate: {df['churned'].mean():.1%}")
print(df.head())

Complete the following tasks:

a) Compute the overall churn rate. What would the accuracy be if you predicted "no churn" for every customer?

b) Create a crosstab of churn rate by contract_type. Which contract type has the highest churn rate? Does this make business sense?

c) Plot the distribution of tenure_months for churned vs. retained customers (overlaid histograms or KDE plots). What do you observe?

d) Calculate the correlation between each numeric feature and the churned target. Which feature has the strongest linear association with churn?

e) Split the data into train (80%) and test (20%) sets using train_test_split with random_state=42 and stratify=y. Print the churn rate in both sets to verify stratification worked.

from sklearn.model_selection import train_test_split

X = df.drop('churned', axis=1)
y = df['churned']

# Your code here

Exercise 8: The Wrong Question (Conceptual)

Read the following scenario and identify what is wrong with the ML framing.

A hospital wants to reduce emergency department (ED) wait times. The analytics team proposes building a model to predict how long each patient will wait in the ED, using features like time of arrival, day of week, chief complaint, and current ED census. The model would display predicted wait times on a screen in the waiting room.

Write a paragraph explaining: a) Why this framing is problematic. b) What business question the hospital is actually trying to answer. c) How you would reframe the ML problem to address the real need.

Exercise 9: R-Squared Is Not Enough (Coding)

Run the following code and answer the questions below.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split

np.random.seed(42)
n = 200
X = np.random.uniform(0, 10, n).reshape(-1, 1)
y = 3 * np.sin(X.ravel()) + 0.5 * X.ravel() + np.random.normal(0, 1.5, n)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

results = []
for degree in [1, 2, 3, 5, 10, 20]:
    poly = PolynomialFeatures(degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_poly, y_train)

    r2_train = r2_score(y_train, model.predict(X_train_poly))
    r2_test = r2_score(y_test, model.predict(X_test_poly))
    mae_test = mean_absolute_error(y_test, model.predict(X_test_poly))

    results.append({
        'degree': degree,
        'r2_train': round(r2_train, 4),
        'r2_test': round(r2_test, 4),
        'mae_test': round(mae_test, 4),
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

a) At what polynomial degree does training R-squared first exceed 0.95?

b) At what polynomial degree does test R-squared peak?

c) What happens to test R-squared at degree 20? Why?

d) If you reported only training R-squared, which degree would you choose? If you reported test R-squared, which would you choose? Explain the discrepancy.

Exercise 10: Problem Framing Practice (Applied)

For each of the following business scenarios, write a one-paragraph ML problem framing that specifies: (1) the prediction target, (2) the observation unit, (3) the prediction horizon, (4) one likely useful feature, and (5) one feature that would be target leakage.

a) A ride-sharing company wants to anticipate driver shortages in specific neighborhoods during peak hours.

b) A content moderation team wants to automatically flag user-generated comments that violate community guidelines.

c) An agricultural company wants to predict crop yield for individual fields to optimize fertilizer distribution.

d) A university admissions office wants to predict which admitted students will actually enroll (yield prediction).

Exercise 11: Framing Debate (Discussion)

A large retail bank wants to "predict fraud." Two data scientists propose different framings:

Data Scientist A proposes: Binary classification. Target = is_fraudulent_transaction (1/0). Observation unit = individual transaction. Predict at the time of transaction authorization. Action: block the transaction if fraud probability exceeds a threshold.
Data Scientist B proposes: Anomaly detection. No explicit target variable. Model learns the "normal" transaction patterns for each customer. Observation unit = individual transaction. Flag transactions that deviate significantly from the customer's profile. Action: send a verification SMS before authorizing.

Which framing would you choose, and under what circumstances? Is one strictly better than the other? What data requirements differ between the two approaches?

Challenge Exercise: The Meta-Problem (Synthesis)

You are the first data scientist hired at a 200-person company that sells B2B project management software. The CEO hands you a list of ten "problems to solve with data science." You have time and resources to tackle three in the first year.

The list: 1. Predict which trial users will convert to paid 2. Forecast next quarter's revenue 3. Identify product features most correlated with retention 4. Detect anomalous usage patterns that indicate security breaches 5. Recommend training content to new users based on their role 6. Predict which enterprise deals will close this quarter 7. Segment customers for the marketing team 8. Optimize the pricing page layout via A/B testing 9. Predict customer support ticket volume for staffing 10. Identify at-risk accounts for the customer success team

For each item: a) Classify it (descriptive, inferential, or predictive) b) Identify the ML problem type (classification, regression, clustering, ranking, recommendation, anomaly detection, or "not ML") c) Estimate the business impact (high/medium/low) and technical difficulty (high/medium/low)

Then: choose the three you would tackle first and write a one-paragraph justification for your prioritization. What criteria did you use?

These exercises support Chapter 1: From Analysis to Prediction. Return to the chapter for reference.