Quiz: Chapter 35
Capstone --- End-to-End ML System
Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2--4 sentences.
Question 1 (Multiple Choice)
A data scientist builds a churn model in a Jupyter notebook and sends the pickle file to the engineering team for deployment. The engineering team writes separate preprocessing code for the API. In production, the model's AUC drops from 0.88 (notebook) to 0.71 (production). What is the most likely cause?
- A) The model overfit the training data
- B) Data drift occurred between training and deployment
- C) Train-serve skew: the preprocessing in production differs from the preprocessing in the notebook
- D) The engineering team used a different version of scikit-learn
Answer: C) Train-serve skew: the preprocessing in production differs from the preprocessing in the notebook. When preprocessing code is duplicated between training and serving, any inconsistency --- different imputation values, different scaling parameters, different feature ordering --- produces inputs the model was not trained on. The result is degraded performance that looks like drift but is actually an engineering bug. The solution is to serialize the preprocessing pipeline alongside the model and use the same artifact in both contexts. This is the most common cause of "the model worked in the notebook but not in production."
Question 2 (Short Answer)
The StreamFlow churn model uses a 60-day prediction window. Explain why this creates a monitoring challenge, and describe one approach to partially mitigate the label delay.
Answer: A 60-day prediction window means the true label (did the subscriber churn?) is not available until 60 days after the prediction. During those 60 days, the model could be degrading and nobody would know from performance metrics alone. One mitigation approach is to monitor proxy signals that are available in real-time: data drift (PSI on input features), prediction distribution shifts (are the predicted probabilities trending higher or lower?), and early behavioral signals (subscribers who stop logging in within 7 days of a high-risk prediction). These proxies do not replace ground-truth evaluation, but they provide early warnings that something has changed.
Question 3 (Multiple Choice)
Which of the following is the strongest argument for making the fairness audit a blocking gate (model does not deploy if it fails) rather than an advisory check (model deploys with a warning)?
- A) A blocking gate prevents all forms of bias in the model
- B) A blocking gate ensures the model's AUC is above the minimum threshold
- C) A blocking gate forces the team to address fairness issues during development, when they are cheapest to fix, rather than after deployment
- D) A blocking gate is required by law for all ML models
Answer: C) A blocking gate forces the team to address fairness issues during development, when they are cheapest to fix, rather than after deployment. Advisory checks are routinely ignored under deadline pressure ("we will fix it in the next release"). A blocking gate makes fairness a non-negotiable requirement, just like a minimum AUC threshold. Option A is wrong because no audit prevents all bias --- it detects measured disparities. Option B confuses fairness with performance. Option D is incorrect; there is no universal legal requirement, though specific domains (lending, hiring) have regulatory requirements.
Question 4 (Multiple Choice)
A monitoring system detects that PSI for sessions_last_30d has exceeded 0.25. The model's AUC on the most recent labeled data (from 60 days ago) is still 0.86. What is the correct response?
- A) Ignore the PSI alert because AUC is still good
- B) Immediately retrain the model with the latest data
- C) Investigate the cause of the drift, check whether it affects model predictions, and prepare for a possible retrain
- D) Lower the PSI threshold to 0.30 so the alert does not fire
Answer: C) Investigate the cause of the drift, check whether it affects model predictions, and prepare for a possible retrain. PSI measures input distribution shift, not performance degradation. Performance might still be fine if the relationship between features and the target has not changed (data drift without concept drift). However, the AUC measurement is 60 days old and may not reflect current performance. The correct response is investigation: what caused the shift (product change? seasonal effect? data pipeline bug?), does it affect the features the model relies on most (check SHAP importance), and should we retrain proactively? Option A ignores the signal. Option B is premature. Option D is alarm-silencing, not problem-solving.
Question 5 (Short Answer)
Explain the difference between a pipeline that runs once and a system that runs continuously. Use the StreamFlow capstone as an example, and identify at least two components that must operate on an ongoing basis after initial deployment.
Answer: A pipeline runs once to produce an output (e.g., a trained model). A system runs continuously to serve predictions, monitor performance, and adapt over time. In the StreamFlow capstone, two components operate on an ongoing basis: (1) the monitoring dashboard, which computes PSI daily and performance metrics weekly to detect drift and degradation, and (2) the batch scoring job, which runs nightly to score all active subscribers and produce the weekly high-risk list for the customer success team. The API endpoint also runs continuously, serving real-time predictions. The system forms a loop: predictions drive interventions, interventions produce outcomes, outcomes update the ROI analysis, and the ROI analysis informs the business question.
Question 6 (Multiple Choice)
A data scientist presents the StreamFlow capstone to the CFO. The CFO asks: "What is the ROI?" The data scientist answers: "The model saves $47,000 per month." What is wrong with this response?
- A) Nothing; the CFO asked for a number and got one
- B) The response should include the AUC and precision metrics
- C) The response presents a point estimate without acknowledging the assumptions or providing a range
- D) The ROI should be expressed as a percentage, not a dollar amount
Answer: C) The response presents a point estimate without acknowledging the assumptions or providing a range. The $47,000 figure depends on assumptions about intervention success rate, subscriber lifetime value, and attribution of retention to the model. The CFO is experienced enough to know that assumptions can be wrong. A credible response would be: "The model saves between $28,000 and $65,000 per month depending on intervention effectiveness, and it remains profitable at intervention success rates as low as 30%." This demonstrates rigor and builds trust. A single point estimate invites the question "what if you are wrong?" and leaves the data scientist without an answer.
Question 7 (Multiple Choice)
Which of the following is the best reason to document architectural decisions (the "Architectural Decisions Log" in the chapter)?
- A) To prove that the data scientist considered alternatives
- B) To satisfy compliance requirements
- C) So that the person who maintains the system six months from now understands not just what was built, but why each choice was made
- D) To justify the project's budget
Answer: C) So that the person who maintains the system six months from now understands not just what was built, but why each choice was made. Architectural decisions encode context: why LightGBM instead of XGBoost, why batch + real-time instead of batch only, why the fairness gate is blocking instead of advisory. Without this documentation, a future engineer might change a decision without understanding its consequences. The code shows what was built. The decision log shows why. This is particularly important for ML systems where many design choices (threshold, serving mode, retraining policy) are not visible in the code itself.
Question 8 (Short Answer)
The chapter identifies four things that did not work in the StreamFlow capstone. Choose one and explain why discovering this after building the system is more valuable than reading about it in a textbook.
Answer: Take the 60-day churn window problem: the chapter notes it is too long and a 30-day window would catch subscribers earlier. Reading this in a textbook is abstract knowledge. Discovering it after building the system means you experienced the consequences: you saw labeled data arriving too late for monitoring, you saw subscribers who were already gone by the time the model flagged them, and you felt the frustration of knowing the intervention was too late. That experiential knowledge changes your behavior on the next project. You will ask "what is the label delay?" and "is the prediction window matched to the intervention timeline?" before writing any code, because you remember what happens when you don't.
Question 9 (Multiple Choice)
A junior data scientist builds a capstone project (Track A) and deploys the model as a FastAPI endpoint. She does not include experiment tracking, fairness audit, or monitoring. A senior data scientist reviews the project and says: "This is a good start, but it is not production-ready." Which missing component represents the highest risk?
- A) Experiment tracking (MLflow)
- B) Fairness audit
- C) Monitoring (drift detection and performance tracking)
- D) ROI analysis
Answer: C) Monitoring (drift detection and performance tracking). Without monitoring, the model will degrade silently. There is no mechanism to detect data drift, concept drift, or performance decay. The model could be producing harmful predictions for months before anyone notices. Experiment tracking is important for reproducibility but does not affect the running system. Fairness audit is critical before deployment but is a one-time gate. ROI analysis is important for stakeholders but does not affect model correctness. Monitoring is the component that protects the system after deployment, and its absence is the highest-risk gap.
Question 10 (Short Answer)
The chapter describes three capstone tracks: Minimal (Track A), Full (Track B), and Extended (Track C). A hiring manager is reviewing portfolios. Explain why Track C is more impressive than Track B, and identify the specific deliverable that most strongly signals senior-level thinking.
Answer: Track C adds two elements that distinguish it from Track B: an A/B test design for the intervention program and a written retrospective. The A/B test design demonstrates causal thinking --- understanding that the ROI analysis is based on correlational assumptions and that a randomized experiment is needed to establish causation. The written retrospective is the deliverable that most strongly signals senior-level thinking, because it requires self-assessment: "what worked, what didn't, and what I'd do differently." This demonstrates that the candidate can learn from their own mistakes, communicate honestly about limitations, and think beyond the immediate technical task. Hiring managers look for this because it predicts how the candidate will perform on real projects where things go wrong.
Question 11 (Multiple Choice)
The chapter states: "Automate the detection. Keep the decision manual." This refers to:
- A) Feature engineering: automate feature creation, manually select features
- B) Monitoring and retraining: automate drift alerts, manually decide whether to retrain
- C) Deployment: automate the CI/CD pipeline, manually approve each release
- D) Fairness: automate metric computation, manually set the acceptable threshold
Answer: B) Monitoring and retraining: automate drift alerts, manually decide whether to retrain. The chapter argues that automatic retraining is dangerous because a model that retrains on drifted data without human review can learn to reproduce whatever caused the drift --- including data pipeline bugs, promotional events, or data quality issues. Automated detection ensures drift is noticed promptly. Manual decision-making ensures a human diagnoses the cause and determines the appropriate response (retrain, re-engineer features, investigate a business change, or wait).
Question 12 (Short Answer)
The retrospective section says: "Starting with the business question kept the project focused. We never built a feature that nobody would use." Explain what this means in practical terms, and give a concrete example of a feature that would have been built without this discipline.
Answer: Defining the business question with precision --- "predict churn within 60 days for the customer success team" --- constrains which features are relevant and which are noise. In practical terms, this means every feature must answer: "Does this help predict churn?" and "Can the CS team act on this?" Without this discipline, a data scientist might build features like total_hours_watched_all_time (interesting but adds little over sessions_last_30d for churn prediction), device_type_diversity (technically computable but not actionable by the CS team), or NLP_sentiment_from_support_tickets (sophisticated but overkill when support_tickets_last_90d already captures the signal). Business question discipline prevents the "because we can" trap that wastes development time on features that do not improve the system.
Return to the chapter for the full capstone walkthrough.