Chapter 2 Further Reading

DataField.Dev

Affiliate disclosure

Book titles on this page link to Amazon. As an Amazon Associate, DataField.Dev earns from qualifying purchases — at no additional cost to you.

Chapter 2 Further Reading

The Machine Learning Workflow

Papers

Sculley, D., et al. "Hidden Technical Debt in Machine Learning Systems." NeurIPS 2015. The definitive paper on why ML systems are expensive to maintain. Introduces the concept that ML model code is a small fraction of a production system, surrounded by configuration, data pipelines, monitoring, and infrastructure. Required reading for anyone who will deploy a model. The diagrams showing the "tiny box of ML code" inside a massive system have become iconic in the field.

Kaufman, S., Rosset, S., and Perlich, C. "Leakage in Data Mining: Formulation, Detection, and Avoidance." ACM Transactions on Knowledge Discovery from Data, 2012. The most rigorous treatment of data leakage in the literature. Provides a formal taxonomy of leakage types, methods for detection, and strategies for prevention. The examples are drawn from real Kaggle competitions and production systems where leakage led to grossly inflated performance estimates. Essential for building intuition about where leakage hides.

Paleyes, A., Urma, R.-G., and Lawrence, N.D. "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys, 2022. A comprehensive survey of what goes wrong in real ML deployments. Covers data management, model training, deployment, and monitoring challenges drawn from published case studies across industries. Useful for understanding that the problems described in this chapter are universal, not specific to any one team or domain.

Books

Huyen, Chip. Designing Machine Learning Systems. O'Reilly, 2022. The best single book on the end-to-end ML workflow from a production perspective. Covers problem framing, data engineering, feature engineering, model development, deployment, and monitoring. Written by a practitioner who has built ML systems at scale. Chapters 2 (Introduction to ML Systems Design) and 9 (Continual Learning and Test in Production) are directly relevant to this chapter's content.

Lakshmanan, V., Robinson, S., and Munn, M. Machine Learning Design Patterns. O'Reilly, 2020. A pattern-language approach to ML system design. Covers 30 design patterns organized into categories: data representation, problem representation, model training, resilient serving, and reproducibility. The "Bridged Schema" pattern (handling schema changes in production data) and the "Stateless Serving Function" pattern (for deployment) are particularly relevant to the workflow stages discussed here.

Frameworks and Standards

CRISP-DM (Cross-Industry Standard Process for Data Mining) The original ML workflow framework, published in 1999. Defines six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Despite its age, the structure holds up remarkably well. The main weakness: it underemphasizes monitoring and maintenance. Every data scientist should know CRISP-DM, if only because it is referenced in virtually every enterprise data science job posting. Available at: https://www.datascience-pm.com/crisp-dm-2/

MLOps Maturity Model (Google Cloud) Google's framework for assessing how mature an organization's ML operations are, from Level 0 (manual, ad-hoc) to Level 2 (full CI/CD/CT automation). Useful for understanding where your team sits and what capabilities to build next. The progression from "data scientist runs everything in a notebook" to "automated pipeline retrains and deploys models with zero human intervention" maps directly to the deployment and monitoring stages in this chapter. Available at: https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

Blog Posts and Industry Perspectives

Zinkevich, Martin. "Rules of Machine Learning: Best Practices for ML Engineering." Google, 2017. Forty-three rules for building production ML systems, written by a Google engineer. Rule 1: "Don't be afraid to launch a product without machine learning." Rule 4: "Keep the first model simple and get the infrastructure right." Rule 16: "Plan to launch and iterate." The rules are short, opinionated, and drawn from years of experience. Many reinforce this chapter's emphasis on baselines and iteration. Available at: https://developers.google.com/machine-learning/guides/rules-of-ml

Sambasivan, N., et al. "'Everyone Wants to Do the Model Work, Not the Data Work': Data Cascades in High-Stakes AI." CHI 2021. A qualitative study of 53 AI practitioners that documents how data quality problems compound across the ML workflow. The term "data cascades" describes how a data problem introduced early (e.g., during collection) causes failures that are only detected much later (e.g., during deployment). Directly relevant to Stage 3 (Data Collection and Validation) and the chapter's emphasis on validating data before training.

Shankar, V., et al. "Operationalizing Machine Learning: An Interview Study." arXiv preprint, 2022. Interviews with ML engineers at major tech companies about the practical challenges of deploying ML models. Key findings: most time is spent on data, not models; monitoring is the most underinvested area; and the transition from prototype to production is the most common failure point. Validates the time allocation estimates (15% modeling, 85% everything else) presented in this chapter.

Tools

scikit-learn User Guide: Cross-validation and Model Selection The official documentation covers StratifiedKFold, TimeSeriesSplit, cross_val_score, and other tools for proper evaluation. The "Visualizing Cross-Validation Behavior" section with plots showing how different CV strategies split the data is particularly educational. Available at: https://scikit-learn.org/stable/modules/cross_validation.html

Great Expectations (open source data validation library) A Python library for validating, documenting, and profiling data. Allows you to define "expectations" (e.g., "this column should never have null values," "this column's values should be between 0 and 1") and run them as automated tests in your data pipeline. Directly relevant to the data validation code shown in Section 2.4. Available at: https://greatexpectations.io/