Exercises: Chapter 35
Capstone --- End-to-End ML System
Exercise 1: Architecture Diagram Review (Conceptual)
The chapter presented a system architecture with nine components. Three of those components have a direct dependency on labeled production data (data that can only be obtained after the prediction window closes).
a) Identify the three components. For each, explain why labeled data is required and how long the label delay is for the StreamFlow churn system (60-day prediction window).
b) A product manager asks: "Why can't we just evaluate the model in real-time, like we monitor API latency?" Write a 3--4 sentence response that explains the label delay problem in non-technical terms.
c) Propose one proxy metric that could be computed in real-time (without waiting for labels) that would give an early warning of model degradation. Explain what it measures, why it is a useful proxy, and what its limitations are.
Exercise 2: Pipeline Consistency Check (Code)
A common production bug is train-serve skew: the preprocessing applied during training differs from the preprocessing applied during inference. Write a test that verifies the pipeline produces identical output in both contexts.
import numpy as np
import pandas as pd
import joblib
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
np.random.seed(42)
# Simulate a single subscriber's features
sample_subscriber = {
"tenure_days": 365.0,
"monthly_charges": 14.99,
"sessions_last_30d": 18,
"avg_session_minutes": 42.5,
"weekend_ratio": 0.33,
"unique_content_last_30d": 12,
"days_since_last_activity": 3.0,
"support_tickets_last_90d": 1,
"plan_type": "standard",
"payment_method": "credit_card",
"contract_type": "month_to_month",
"has_partner": "yes",
"has_dependents": "no",
"age_bucket": "25-34",
}
# a) Write a function that takes a dictionary of features and applies the
# same engineered feature creation from the chapter (engagement_intensity,
# content_diversity_ratio, recency_score, support_burden, charges_per_session).
# Return a DataFrame ready for the preprocessor.
# b) Load the preprocessor (or recreate it with the same configuration).
# Transform the sample subscriber using the preprocessor.
# Save the output as "expected_output".
# c) Now simulate the API path: create the DataFrame from the dictionary,
# apply feature engineering, apply the preprocessor.
# Save the output as "api_output".
# d) Assert that expected_output and api_output are identical
# (within floating-point tolerance).
# np.testing.assert_allclose(expected_output, api_output, atol=1e-10)
# e) Introduce a deliberate bug: swap the order of two features in the API path.
# Does the assertion catch it? Why or why not?
# Your code here
Exercise 3: Monitoring Simulation (Code)
Simulate a production scenario where the model encounters data drift and performance decay over six months.
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
np.random.seed(42)
# Training distribution (reference)
n_train = 10000
train_data = pd.DataFrame({
"sessions_last_30d": np.random.poisson(14, n_train),
"avg_session_minutes": np.random.normal(35, 10, n_train).clip(0),
"days_since_last_activity": np.random.exponential(5, n_train),
"monthly_charges": np.random.choice([9.99, 14.99, 24.99], n_train, p=[0.35, 0.35, 0.30]),
"support_tickets_last_90d": np.random.poisson(0.8, n_train),
})
# Simulate six months of production data.
# Month 1-2: no drift (same distribution as training)
# Month 3-4: moderate drift (sessions increase due to new feature launch)
# Month 5-6: severe drift (sessions spike + support tickets double due to outage)
# a) For each month, generate 5,000 production observations with the
# appropriate distributional shifts. Store as a list of DataFrames.
# b) Compute PSI for each feature for each month, using the training data
# as the reference distribution. Use the compute_psi function from the chapter
# or implement your own.
# c) Create a table showing PSI values over time:
# | Month | sessions_last_30d | avg_session_minutes | days_since_last_activity | ... |
# Highlight any cells where PSI > 0.10 (warning) or > 0.25 (alert).
# d) For months 5-6, would the drift in support_tickets_last_90d be detected
# by PSI? Why might PSI underperform for count features with low cardinality?
# e) Propose an alternative drift detection method for low-cardinality count
# features. Implement it and show that it detects the drift in month 5.
# Your code here
Exercise 4: Fairness Under Threshold Changes (Code)
The StreamFlow churn model uses a threshold of 0.20. Investigate how the threshold affects fairness across age groups.
import numpy as np
import pandas as pd
from sklearn.metrics import precision_score, recall_score
np.random.seed(42)
# Simulate predictions for three age groups
n = 3000
age_groups = np.random.choice(["18-24", "25-44", "45+"], n, p=[0.25, 0.50, 0.25])
# Simulate model probabilities (the model is slightly better calibrated for 25-44)
y_true = np.random.binomial(1, 0.12, n)
noise = np.where(age_groups == "25-44", 0.0, 0.03) # extra noise for edge groups
y_proba = np.where(
y_true == 1,
np.random.beta(4, 2, n) + np.random.normal(0, noise),
np.random.beta(1.5, 5, n) + np.random.normal(0, noise),
).clip(0, 1)
# a) For thresholds from 0.10 to 0.40 (in steps of 0.05), compute precision,
# recall, and positive rate (fraction flagged) for each age group.
# b) Create a table showing how the demographic parity ratio (positive rate
# for group / overall positive rate) changes with the threshold.
# c) At what threshold does the worst-case demographic parity ratio drop
# below 0.80? Is this the same threshold that maximizes overall recall?
# d) Write a 2-3 sentence recommendation for the customer success VP:
# "Given the fairness constraints, the recommended threshold is X because..."
# Your code here
Exercise 5: Stakeholder Presentation (Writing + Code)
You have completed the full StreamFlow capstone (Track B). The VP of Customer Success has asked for a 15-minute presentation at the quarterly business review. The audience includes the CFO, the CTO, and two product managers.
a) Write slide titles (not content, just titles) for a 10-slide deck. Each title should state a takeaway, not a topic. (Recall from Chapter 34: "The model catches 72% of churners" is a takeaway. "Model Performance" is a topic.)
b) The CFO asks: "What happens if the churn rate doubles next quarter because of the price increase?" Using the compute_monthly_roi function from the chapter, compute the ROI under two scenarios:
import numpy as np
np.random.seed(42)
# Scenario 1: Current churn rate (12%)
# Scenario 2: Doubled churn rate (24%)
# Assume all other parameters remain the same.
# Compute and compare the monthly ROI for both scenarios.
# Your code here
c) The CTO asks: "What is the latency of the /predict endpoint? Can we use it in the checkout flow?" Write a 3--4 sentence response explaining why this API is designed for the customer success use case and what would need to change to support real-time checkout-flow integration.
d) A product manager asks: "Can we use this model to decide which subscribers get a discount?" Write a 3--4 sentence response that addresses both the technical feasibility and the ethical implications (reference Chapter 33).
Exercise 6: Retrospective Practice (Writing)
Choose one of the three alternative capstone datasets (hospital readmission, manufacturing predictive maintenance, or e-commerce conversion). Without building the system, write a pre-mortem: a document that imagines the system has been deployed and has failed.
a) Describe the failure scenario in 3--4 sentences. Be specific: what broke, when, and what was the business impact?
b) Trace the failure back to an architectural decision. Which component failed, and what decision during development made the failure possible?
c) Describe the monitoring signal that should have caught the failure. Was it a drift signal, a performance signal, or a business metric?
d) Write the retrospective entry: "What I would do differently to prevent this failure."
Exercise 7: End-to-End Integration Test (Code)
Write an integration test that validates the full prediction pipeline from raw features to API response. This test should catch the most common production bugs: missing features, type mismatches, and preprocessing errors.
import numpy as np
import pandas as pd
import json
np.random.seed(42)
# Define five test cases that exercise different edge cases
test_cases = [
{
"name": "typical_subscriber",
"features": {
"tenure_days": 365.0,
"monthly_charges": 14.99,
"sessions_last_30d": 18,
"avg_session_minutes": 42.5,
"weekend_ratio": 0.33,
"unique_content_last_30d": 12,
"days_since_last_activity": 3.0,
"support_tickets_last_90d": 1,
"plan_type": "standard",
"payment_method": "credit_card",
"contract_type": "month_to_month",
"has_partner": "yes",
"has_dependents": "no",
"age_bucket": "25-34",
},
"expected_risk": "low", # or "medium" or "high"
},
{
"name": "high_risk_inactive",
"features": {
"tenure_days": 45.0,
"monthly_charges": 9.99,
"sessions_last_30d": 0,
"avg_session_minutes": 0.0,
"weekend_ratio": 0.0,
"unique_content_last_30d": 0,
"days_since_last_activity": 30.0,
"support_tickets_last_90d": 3,
"plan_type": "basic",
"payment_method": "bank_transfer",
"contract_type": "month_to_month",
"has_partner": "no",
"has_dependents": "no",
"age_bucket": "18-24",
},
"expected_risk": "high",
},
# c) Add three more test cases:
# - A brand-new subscriber (tenure_days = 7, minimal activity)
# - A long-tenure premium subscriber with high engagement
# - A subscriber with missing/edge-case values (e.g., weekend_ratio = 0.0,
# zero sessions but nonzero avg_session_minutes --- data quality issue)
]
# a) Write a function that takes a test case dictionary, applies feature
# engineering, applies preprocessing, runs the model, and returns the
# prediction response (probability, risk level, top reasons).
# b) For each test case, verify that:
# - The probability is between 0 and 1
# - The risk level is one of ["low", "medium", "high"]
# - The top_reasons list has exactly 3 entries
# - Each SHAP value is a finite number (not NaN, not Inf)
# c) For the "high_risk_inactive" case, verify that the probability is > 0.20
# and the risk level is "high". For the "long_tenure_premium" case, verify
# that the probability is < 0.10.
# d) Run all test cases and print a pass/fail summary.
# Your code here
Exercise 8: Capstone Planning (Conceptual)
You are starting a capstone project with the NASA Turbofan Engine Degradation dataset (C-MAPSS). The business question is: "Which engines are most likely to fail within the next 14 cycles, so that maintenance can be scheduled during planned downtime?"
a) List the nine capstone components from the chapter. For each, describe in 1--2 sentences how it would differ from the StreamFlow implementation. Focus on the differences, not the similarities.
b) The cost structure is fundamentally different from churn prediction. A missed failure (false negative) can cost $500,000 in unplanned downtime and equipment damage. A false alarm (false positive) costs $15,000 in unnecessary maintenance. What threshold would you recommend, and why? Show the break-even precision calculation.
c) Sensor data is time-series data. The StreamFlow model used tabular features. Describe three feature engineering approaches for converting time-series sensor readings into tabular features suitable for gradient boosting. (Hint: consider rolling statistics, trend features, and change-point features.)
d) The maintenance team has 20 years of experience and does not trust "black box" models. Describe your SHAP explanation strategy: what would you show the maintenance team, and how would you frame it?
Solutions to selected exercises are available in the Answers to Selected Exercises appendix.