Quiz: Chapter 36

The Road to Advanced: Deep Learning, Causal Inference, MLOps, and Where to Go Next


Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2--4 sentences.


Question 1 (Multiple Choice)

For which of the following problems is gradient boosting (XGBoost/LightGBM) most likely to outperform deep learning?

  • A) Classifying chest X-rays as normal vs. pneumonia
  • B) Predicting customer churn from 30 tabular features (usage, demographics, billing)
  • C) Generating product descriptions from product attributes
  • D) Detecting objects in security camera footage

Answer: B) Predicting customer churn from 30 tabular features. Research consistently shows that gradient boosting matches or outperforms deep learning on tabular data, while being faster to train, easier to interpret, and requiring less data. Deep learning excels on unstructured data --- images (A, D) and text generation (C) --- where spatial or sequential patterns need to be learned directly from raw data.


Question 2 (Multiple Choice)

What does backpropagation compute?

  • A) The optimal values of the network weights
  • B) The gradient of the loss function with respect to every weight in the network
  • C) The forward pass through the network
  • D) The learning rate schedule for training

Answer: B) The gradient of the loss function with respect to every weight in the network. Backpropagation uses the chain rule of calculus to compute how much each weight contributed to the prediction error. The optimizer (SGD, Adam) then uses these gradients to update the weights. Backpropagation computes the direction and magnitude of the update; it does not compute the optimal values directly.


Question 3 (Short Answer)

Explain the fundamental problem of causal inference. Use the StreamFlow retention offer as an example.

Answer: The fundamental problem is that for any individual subscriber, you can only observe one outcome: what happened when they received the offer, or what would have happened without it --- never both simultaneously. The unobserved outcome is the counterfactual. For a subscriber who received the retention offer and stayed, you cannot know whether they would have stayed anyway. Causal inference methods estimate what the missing counterfactual outcomes would have been, using randomization, matching, or natural experiments.


Question 4 (Multiple Choice)

StreamFlow sends a retention offer to high-risk subscribers (model score > 0.20) and observes that their churn rate drops from 22% to 14%. A data analyst concludes the offer reduced churn by 8 percentage points. What is the primary flaw in this reasoning?

  • A) The sample size is too small
  • B) The analyst did not account for selection bias --- subscribers who received the offer were already different from those who did not
  • C) The analyst should have used precision instead of churn rate
  • D) The 8 percentage point reduction is not statistically significant

Answer: B) The analyst did not account for selection bias. Subscribers with scores above 0.20 are systematically different from those below 0.20 --- they have different usage patterns, tenure, and engagement. Some of the observed churn reduction may be due to these differences (e.g., regression to the mean) rather than the retention offer. A causal estimate requires a valid comparison group, such as a randomized experiment or a difference-in-differences design.


Question 5 (Multiple Choice)

In a difference-in-differences analysis, what assumption must hold for the estimate to be valid?

  • A) The treatment and control groups must have the same baseline outcome level
  • B) The treatment and control groups must have followed parallel trends in the absence of treatment
  • C) The treatment must be randomly assigned
  • D) The sample sizes must be equal

Answer: B) The treatment and control groups must have followed parallel trends in the absence of treatment. The parallel trends assumption states that without the intervention, both groups would have changed at the same rate over time. The groups do not need the same baseline level (A), the treatment does not need to be randomized (C) --- that is the whole point of DiD as an observational method --- and sample sizes need not be equal (D).


Question 6 (Short Answer)

What is a feature store, and what problem does it solve?

Answer: A feature store is a centralized system for defining, computing, storing, and serving features for ML models. It solves the training-serving skew problem: ensuring that the features computed during model training (from historical data in batch) are identical to the features computed during model serving (from live data in real time). Without a feature store, teams often have two separate codepaths for computing features, leading to subtle inconsistencies that degrade model performance in production.


Question 7 (Multiple Choice)

An organization has: automated data pipelines, experiment tracking with MLflow, a model deployed via FastAPI, and basic monitoring dashboards. Models are retrained manually when drift is detected. What MLOps maturity level is this?

  • A) Level 0 (Manual Process)
  • B) Level 1 (ML Pipeline Automation)
  • C) Level 2 (CI/CD for ML)
  • D) Level 3 (Full Automation with Governance)

Answer: B) Level 1 (ML Pipeline Automation). The organization has automated data pipelines and experiment tracking, and a deployed model with monitoring --- but retraining is still manual. Level 2 requires automated retraining triggered by drift detection, CI/CD for model code, automated testing (data validation, model performance gates), and canary deployment. Level 0 would have no automation at all.


Question 8 (Multiple Choice)

A data scientist wants to determine whether a new onboarding tutorial causes users to engage more with the product. They compare users who completed the tutorial (treatment) with users who skipped it (control) and find that tutorial completers have 40% higher engagement. Why is this not a valid causal estimate?

  • A) The treatment effect is too large to be believable
  • B) Users who choose to complete the tutorial are self-selected --- they are likely already more motivated and engaged
  • C) Engagement is not a valid outcome metric
  • D) The analysis should use deep learning instead of a simple comparison

Answer: B) Users who choose to complete the tutorial are self-selected. Motivated users are more likely to both complete the tutorial and engage more with the product. The observed difference conflates the effect of the tutorial with pre-existing differences in motivation. A valid causal estimate would require random assignment to tutorial vs. no tutorial, or an observational method that accounts for this confounding.


Question 9 (Short Answer)

Name two skills from this textbook that transfer directly to deep learning work, and explain why.

Answer: (1) Honest evaluation (Chapters 16--19) --- deep learning models are even more prone to overfitting than classical models due to their high parameter count, making proper train/validation/test splits, early stopping, and skepticism about metrics even more critical. (2) Feature engineering judgment (Chapters 6--9) --- while deep learning can learn features from raw data, decisions about data augmentation, input preprocessing, and transfer learning require the same intuition about "what information does the model need?" that was developed throughout the feature engineering chapters.


Question 10 (Multiple Choice)

Which of the following is the best advice for choosing between batch prediction and real-time prediction?

  • A) Always use real-time prediction because it provides the most current results
  • B) Use real-time prediction only when decisions must be made in sub-second response times
  • C) Use batch prediction only when you lack the infrastructure for real-time
  • D) Real-time prediction is more accurate because it uses the latest data

Answer: B) Use real-time prediction only when decisions must be made in sub-second response times. Batch prediction (pre-computing predictions on a schedule) is simpler, cheaper, more reliable, and sufficient for the vast majority of use cases (nightly churn scores, weekly demand forecasts, daily anomaly reports). Real-time prediction adds complexity (low-latency feature retrieval, API availability, scaling) that is only justified when the business requires immediate predictions (e.g., fraud detection at transaction time).


Question 11 (Short Answer)

A colleague says: "We should switch from XGBoost to a neural network for our tabular churn prediction model because neural networks are more powerful." How would you respond?

Answer: "More powerful" is not the right framing. On tabular data, gradient boosting models consistently match or outperform neural networks in benchmarks, while being faster to train, easier to interpret, and requiring less hyperparameter tuning. The relevant question is whether the data type or problem structure favors deep learning --- and for structured tabular data with well-defined features, it generally does not. Switch to deep learning when you have unstructured data (images, text, audio) or when you have evidence that a neural architecture outperforms on your specific dataset.


Question 12 (Multiple Choice)

What distinguishes causal inference from predictive modeling?

  • A) Causal inference uses more sophisticated algorithms
  • B) Causal inference answers "what would happen if we intervened?" while predictive modeling answers "what is likely to happen?"
  • C) Causal inference requires larger datasets
  • D) Causal inference does not use statistical models

Answer: B) Causal inference answers "what would happen if we intervened?" (a counterfactual question) while predictive modeling answers "what is likely to happen given the observed features?" (a conditional probability question). A churn model predicts who will churn. Causal inference estimates whether a specific intervention (e.g., a retention offer) changes the probability of churn. The two require different methods, different assumptions, and different data structures.


Question 13 (Short Answer)

You have completed this textbook. Name the single most important skill you have developed, and explain why it matters more than any specific algorithm.

Answer: Judgment. The ability to decide which problem to solve, which features to engineer, which model to try, which metric to optimize, which threshold to set, which limitations to acknowledge, and which results to communicate. Algorithms are tools. Judgment is knowing which tool to use, when to use it, and when the tool is not the right answer. Every algorithm in this book will be superseded by newer methods. The judgment to apply them correctly will remain valuable throughout a career.


This quiz covers Chapter 36: The Road to Advanced. Return to the chapter to review concepts.