Key Takeaways: Chapter 36
The Road to Advanced: Deep Learning, Causal Inference, MLOps, and Where to Go Next
-
Deep learning is essential for images, text, and audio --- and usually unnecessary for tabular data. Gradient boosting (XGBoost, LightGBM, CatBoost) still outperforms neural networks on structured, tabular data in the majority of benchmarks. Deep learning earns its complexity when the input is unstructured: pixel grids (images), token sequences (text), waveforms (audio), or video. Before reaching for a neural network, ask: "Is my data tabular?" If yes, start with gradient boosting.
-
Transfer learning is almost always the right starting point for deep learning. Training a neural network from scratch requires massive datasets (50,000+ labeled images, millions of text examples). Fine-tuning a pre-trained model (ResNet for images, BERT for text) requires far less data (hundreds to thousands of examples) and trains faster. The pre-trained model has already learned general features (edges, textures, syntax, semantics). You teach it the specifics of your domain.
-
Predictive models answer "what will happen?" Causal inference answers "what happens if we intervene?" A churn model predicts who will cancel. It does not tell you whether the retention offer caused anyone to stay. These are different questions requiring different methods. Most ML teams are strong on prediction and weak on causation --- which means they often cannot prove that their interventions work.
-
The fundamental problem of causal inference is a missing data problem. For every individual, you observe one outcome: what happened with treatment or what happened without it. The counterfactual --- the outcome you did not observe --- is missing by definition. Causal inference methods (randomized experiments, difference-in-differences, propensity score matching, regression discontinuity, instrumental variables) all estimate what the missing counterfactual would have been.
-
Randomized experiments are the gold standard for causal inference, but they are not always possible. When you can randomize (A/B tests), do it. When you cannot (ethical constraints, operational constraints, retrospective analysis), use observational methods. The key is understanding the assumptions each method requires and honestly assessing whether those assumptions hold in your context.
-
Difference-in-differences requires parallel trends; propensity score matching requires no unobserved confounders; regression discontinuity requires a sharp treatment threshold. No causal method is assumption-free. The value of understanding multiple methods is that you can triangulate: if three methods with different assumptions produce similar estimates, the result is more credible than any single estimate alone.
-
MLOps maturity progresses from manual notebooks (Level 0) to fully automated pipelines with governance (Level 3). Most organizations are at Level 0 or 1. The goal is not to reach Level 3 --- it is to reach the level that matches your organization's needs. A company with two models in production does not need a feature store. A company with fifty models does.
-
The training-serving skew problem is the most common source of silent production failures in ML. When training features are computed differently from serving features --- even subtly (e.g.,
fillna(median)vs.COALESCE(value, 0)) --- model performance degrades in production without any visible error. Feature stores solve this by defining features once and materializing them to both training and serving environments. -
Start with batch prediction. Move to real-time only when the business requires it. Nightly churn scores, weekly demand forecasts, daily anomaly reports --- all batch. Batch prediction is simpler, cheaper, and more reliable. Real-time prediction (sub-second response) adds complexity that is only justified for use cases like fraud detection at transaction time or real-time personalization.
-
Four paths branch from this foundation: NLP, computer vision, experimentation, and ML engineering. Each path has its own tools, community, and career trajectory. The right choice depends on what energizes you, not what seems most prestigious. All four paths value the same foundational skills: honest evaluation, reproducible pipelines, stakeholder communication, and ethical awareness.
-
Five skills from this textbook transfer to every specialization. Feature engineering judgment (knowing what information the model needs), honest evaluation (preventing yourself from fooling yourself), the deployment mindset (making models work outside notebooks), business communication (translating metrics into decisions), and ethical awareness (understanding the potential for harm). These are the skills that compound over a career.
-
The gap between knowing pandas and getting hired as a data scientist is not about algorithms --- it is about judgment. Judgment is knowing which problem to solve, which model to try, which metric to optimize, which threshold to set, which limitations to acknowledge, and which results to communicate. Algorithms are tools. Judgment is knowing when and how to use them. Every algorithm will eventually be superseded. Judgment endures.
If You Remember One Thing
A model is a bet. A bet that the patterns in your historical data will hold in the future, that your features capture the signal that matters, and that your evaluation honestly measures what you think it measures. This book has taught you to make better bets --- to engineer features that encode domain knowledge, to evaluate honestly, to deploy responsibly, and to monitor relentlessly. The field will evolve. The architectures will change. The tools will be replaced. The judgment to apply them wisely is what this book has given you, and it is what will define your career.
These takeaways summarize Chapter 36: The Road to Advanced. Return to the chapter for full context.