Chapter 2 Key Takeaways

DataField.Dev

Chapter 2 Key Takeaways

The Machine Learning Workflow

The ML workflow has eight stages, not three. The real lifecycle is: problem framing, success metric definition, data collection/validation, baseline establishment, feature/model iteration, offline evaluation, deployment/online evaluation, and monitoring/maintenance. Training a model is one step of eight.
Problem framing is the most important and most neglected step. Before writing any code, define five things precisely: the target variable, the observation unit, the prediction timing, what information is available at prediction time, and what action will change based on the prediction. If you cannot answer all five, you are not ready to build a model.
Every model must beat a stupid baseline. Establish the floor with trivially simple approaches: majority class prediction, mean prediction, or a single-feature heuristic. If your model barely beats the baseline, the problem is your features, not your algorithm. The heuristic baseline (what the business already does without ML) is the bar that justifies the ML investment.
Data leakage is the most dangerous pitfall in ML. It occurs when training data contains information not available at prediction time. Warning signs: AUC-ROC above 0.95, single-feature dominance in importance scores, and large gaps between offline and production performance. For every feature, ask: "Would I have this value at the exact moment I need to make the prediction?"
Features matter more than algorithms. In the StreamFlow case, adding well-engineered features improved AUC-ROC from 0.71 to 0.81. Switching from logistic regression to gradient boosting only added another 0.03. Spend your time on domain-driven feature engineering, not algorithm shopping.
Use three data partitions, not two. The training set fits the model. The validation set tunes hyperparameters and selects features (and gets "consumed" in the process). The test set provides one final honest evaluation. If you peek at the test set multiple times and adjust your model, it is no longer a test set. For temporal data, use time-based splits — the test set should always be the most recent data.
Offline evaluation is necessary but not sufficient. Offline metrics (AUC-ROC, precision@K, calibration) tell you how the model ranks and scores on historical data. Online evaluation (A/B testing) tells you whether the model actually improves business outcomes in the real world. Both are required.
Accuracy is almost never the right metric for production ML. When the positive class is rare (8.2% churn at StreamFlow), a model that predicts the majority class achieves high accuracy while providing zero value. Use AUC-ROC, precision-recall metrics, or domain-specific metrics (like Precision@K) that account for class imbalance and operational constraints.
Deployment is the beginning, not the end. Models decay. Data distributions shift. Feature pipelines break. Product changes introduce new categories the model has never seen. Monitor prediction distributions, feature distributions, and actual outcomes. Establish a retraining cadence. Budget 20% of project time for monitoring and maintenance.
The ML model code is approximately 5% of a production ML system. The other 95% is data pipelines, feature computation, model serving infrastructure, monitoring, alerting, retraining orchestration, and configuration management. Every model you deploy adds maintenance burden. Deploy models only when the expected business value clearly exceeds the ongoing cost.
Define a metric hierarchy before you start. Primary metrics determine if the model is good. Secondary metrics provide additional diagnostic information. Guardrail metrics ensure the model does not cause unintended harm. Establish this hierarchy with stakeholders before building the model, not after — it prevents arguments about what "good" means.
Budget your time realistically. A typical ML project: 20% problem framing and data understanding, 25% data collection and pipeline building, 20% feature engineering, 15% modeling and evaluation, 20% deployment and monitoring. The actual modeling — what most people picture as "data science" — is 15% of the work.

Quick Reference: The Eight-Stage ML Lifecycle

Stage 1: Problem Framing
  -> What are we predicting? For whom? What action changes?

Stage 2: Success Metrics
  -> Offline metrics (AUC-ROC, Precision@K)
  -> Online metrics (A/B test outcomes, revenue impact)
  -> Guardrail metrics (what must NOT get worse)

Stage 3: Data Collection and Validation
  -> Extract from source systems
  -> Validate: nulls, duplicates, distributions, temporal coverage

Stage 4: Baseline Establishment
  -> Stupid baselines (majority class, predict mean)
  -> Business heuristic baseline (what the team does today)

Stage 5: Feature Engineering and Model Iteration
  -> Features first, algorithms second
  -> Log every iteration

Stage 6: Offline Evaluation
  -> Test set: evaluated ONCE
  -> Check: leakage, calibration, subgroup performance

Stage 7: Deployment and Online Evaluation
  -> Shadow deployment first, then canary, then full rollout
  -> A/B test against the current business process

Stage 8: Monitoring and Maintenance
  -> Track: prediction drift, feature drift, actual outcomes
  -> Retrain on schedule or when drift exceeds thresholds

The One-Sentence Version

Most ML projects fail not because the algorithm was wrong, but because the problem was framed incorrectly, the data leaked, or nobody monitored the model after deployment.