Chapter 12 Further Reading: From Model to Production — MLOps
Foundational Papers and Books
1. Sculley, D., Holt, G., Golovin, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems (NeurIPS). The single most important paper on the reality of production ML. Sculley and colleagues at Google demonstrate that the actual ML model code represents a tiny fraction of a production ML system — the surrounding infrastructure for data collection, feature extraction, configuration, monitoring, and testing dwarfs the modeling code. The paper introduced the concept of "ML-specific technical debt" and remains essential reading for anyone involved in deploying ML. If you read only one item from this list, make it this one.
2. Kreuzberger, D., Kuhl, N., & Hirschl, S. (2023). "Machine Learning Operations (MLOps): Overview, Definition, and Architecture." IEEE Access, 11, 31866-31879. The most comprehensive academic survey of MLOps as a discipline. Kreuzberger, Kuhl, and Hirschl synthesize the scattered practitioner literature into a coherent framework — defining MLOps, classifying its components, and proposing a reference architecture. Useful for readers who want a structured, peer-reviewed overview rather than vendor-specific perspectives.
3. Gift, N., & Deza, A. (2021). Practical MLOps: Operationalizing Machine Learning Models. O'Reilly Media. A hands-on guide to implementing MLOps in practice, with coverage of cloud platforms (AWS, Azure, GCP), CI/CD for ML, model monitoring, and edge deployment. Gift and Deza balance conceptual discussion with practical implementation, including code examples and architecture patterns. Suitable for readers who want to move from understanding MLOps to implementing it.
4. Treveil, M., Omont, N., Stenac, C., et al. (2020). Introducing MLOps: How to Scale Machine Learning in the Enterprise. O'Reilly Media. An accessible introduction to MLOps aimed at business and technical leaders. Treveil and colleagues from Dataiku provide a framework for understanding MLOps maturity, team structures, and the organizational changes required to scale ML. Lighter on implementation detail than Gift and Deza, but stronger on strategy and organizational design. A good starting point for managers who need to understand MLOps without implementing it themselves.
The Deployment Gap and ML System Design
5. Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2022). "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys, 55(6), 1-29. A rigorous survey of deployment challenges across 20+ published case studies from companies including Google, Microsoft, Amazon, and Uber. Paleyes, Urma, and Lawrence categorize challenges into data management, model development, deployment, and monitoring — providing a research-backed complement to the practitioner-focused discussion in this chapter. Particularly strong on the organizational and process barriers to deployment.
6. Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). "Data Lifecycle Challenges in Production Machine Learning: A Survey." ACM SIGMOD Record, 47(2), 17-28. Focuses specifically on the data challenges that arise when ML systems move to production — data validation, feature engineering, training-serving skew, and data drift monitoring. Written by a team at Google, this paper provides detailed discussion of the data-centric challenges that are often underappreciated in ML operations. Directly relevant to Sections 12.5 (Feature Stores) and 12.7 (Monitoring).
7. Lwakatare, L. E., Raj, A., Bosch, J., Olsson, H. H., & Crnkovic, I. (2019). "A Taxonomy of Software Engineering Challenges for Machine Learning Systems: An Empirical Investigation." Proceedings of the 1st International Conference on Agile Software Development (XP 2019). An empirical study of the software engineering challenges specific to ML systems, based on interviews with practitioners at multiple companies. The taxonomy covers data management, model engineering, deployment, and process management. Useful for understanding how traditional software engineering practices must be adapted for ML.
Model Serving and Deployment Patterns
8. Olston, C., Fiedel, N., Gorovoy, K., et al. (2017). "TensorFlow-Serving: Flexible, High-Performance ML Serving." arXiv preprint arXiv:1712.06139. The paper describing TensorFlow Serving, Google's production model serving system. While framework-specific, the architectural principles — model versioning, canary deployment, request batching, and multi-model serving — apply broadly. Useful for readers who want to understand what production-grade model serving looks like under the hood.
9. Crankshaw, D., Wang, X., Zhou, G., et al. (2017). "Clipper: A Low-Latency Online Prediction Serving System." Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI '17). Describes Clipper, a model serving system designed for low-latency prediction. Key contributions include model containers (isolating models from the serving infrastructure), adaptive batching (optimizing throughput without sacrificing latency), and model selection policies (routing requests to different model versions). The architectural concepts are relevant to any real-time serving deployment.
Feature Stores
10. Bhatia, H., et al. (2019). "Building a Feature Store." Uber Engineering Blog. Uber's account of building one of the first large-scale feature stores as part of the Michelangelo platform (Case Study 1). Describes the motivation (feature reuse, training-serving consistency), the architecture (online and offline stores), and the governance challenges. A practical complement to the feature store discussion in Section 12.5.
11. Tecton.ai. (2021). "What is a Feature Store? The Definitive Guide." A comprehensive (if vendor-influenced) guide to feature stores — their purpose, architecture, design patterns, and organizational impact. Covers online vs. offline stores, feature engineering pipelines, governance, and the relationship between feature stores and other MLOps components. Useful for readers evaluating whether and when to implement a feature store.
CI/CD, Testing, and Monitoring
12. Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction." Proceedings of IEEE International Conference on Big Data. A testing rubric from Google that defines 28 specific tests for production ML systems across four categories: data tests, model tests, ML infrastructure tests, and monitoring tests. Each test is scored, providing a quantitative assessment of production readiness. Directly relevant to the testing pyramid described in Section 12.6. Excellent as a checklist when preparing a model for deployment.
13. Klaise, J., Van Looveren, A., Vacanti, G., & Coca, A. (2021). "Monitoring and Explainability of Models in Production." arXiv preprint arXiv:2007.06299. Discusses monitoring strategies for deployed models, with a focus on drift detection methods and explainability tools for production environments. Covers statistical tests for detecting data drift (PSI, KS test, chi-squared), concept drift detection methods, and how explainability techniques (SHAP, LIME) can be used in monitoring dashboards. Relevant to Section 12.7 and connects to Chapter 26 (Fairness, Explainability, and Transparency).
14. Sato, D., Wider, A., & Windheuser, C. (2019). "Continuous Delivery for Machine Learning." Martin Fowler's Blog (martinfowler.com). A thorough, accessible guide to applying continuous delivery principles to ML systems. Covers versioning (data, code, and model), testing strategies, deployment patterns, and monitoring — with clear architectural diagrams and practical examples. Written in Martin Fowler's trademark clarity, this article bridges the gap between software engineering best practices and ML-specific requirements. Ideal for readers with a software engineering background who are new to ML deployment.
MLOps Maturity and Organizational Design
15. Google Cloud. (2020). "MLOps: Continuous Delivery and Automation Pipelines in Machine Learning." Google Cloud Architecture Center. The source document for the MLOps maturity model (Level 0, Level 1, Level 2) used in Section 12.11. Provides detailed descriptions of each maturity level, including the technical capabilities, process changes, and organizational requirements at each level. While Google Cloud-oriented, the maturity model itself is platform-agnostic and widely referenced in the industry.
16. Shankar, V., Garcia, R., Hellerstein, J. M., & Parameswaran, A. G. (2022). "Operationalizing Machine Learning: An Interview Study." arXiv preprint arXiv:2209.09125. Based on interviews with 18 ML engineers across a range of organizations, this study documents the practical challenges of operationalizing ML — from data management and feature engineering through deployment and monitoring. The interview-based methodology captures insights that survey-based studies miss: the workarounds, the frustrations, and the hard-won lessons of practitioners. Particularly valuable for its honest assessment of the gap between MLOps ideals and organizational realities.
17. Sculley, D., Holt, G., Golovin, D., et al. (2014). "Machine Learning: The High-Interest Credit Card of Technical Debt." Proceedings of the SE4ML Workshop at NeurIPS. A precursor to the 2015 NeurIPS paper (item 1), this workshop paper introduces the concept of ML-specific technical debt using the metaphor of a high-interest credit card: easy to accumulate, expensive to pay down. Discusses glue code, pipeline jungles, dead experimental codepaths, and configuration debt. Short, accessible, and important for understanding why MLOps is necessary.
Case Study Background
18. Hermann, J., & Del Balso, M. (2017). "Meet Michelangelo: Uber's Machine Learning Platform." Uber Engineering Blog. The primary source for Case Study 1. Describes Michelangelo's architecture, capabilities, and the organizational motivations behind building a unified ML platform. Written by two of the platform's key architects, it provides insider detail on design decisions, trade-offs, and lessons learned.
19. Bernardi, L., Mavridis, T., Estevez, P., et al. (2019). "150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '19). The primary source for Case Study 2. The paper's most provocative finding — that the correlation between offline model metrics and online business impact is weak — has significant implications for model evaluation and deployment practices. The six lessons span model development, deployment, monitoring, and organizational design.
Tools and Platforms
20. Zaharia, M., Chen, A., Davidson, A., et al. (2018). "Accelerating the Machine Learning Lifecycle with MLflow." IEEE Data Engineering Bulletin, 41(4), 39-45. The paper describing MLflow, the most widely adopted open-source MLOps tool. Covers MLflow's three core components: Tracking (experiment logging), Projects (reproducible runs), and Models (model packaging and deployment). MLflow is recommended in this chapter as a starting point for organizations beginning their MLOps journey.
21. Schelter, S., Lange, D., Schmidt, P., et al. (2018). "Automating Large-Scale Data Quality Verification." Proceedings of the VLDB Endowment, 11(12), 1781-1794. Describes Deequ, Amazon's data quality verification library for large-scale datasets. Covers automated data quality constraints, anomaly detection for data pipelines, and integration with ML workflows. Relevant to the data testing discussion in Section 12.6 and the data monitoring discussion in Section 12.7.
Industry Reports and Surveys
22. Algorithmia. (2021). "2021 Enterprise Trends in Machine Learning." Algorithmia Annual Survey Report. An industry survey of enterprise ML adoption, including data on deployment rates, time-to-deployment, staffing challenges, and tool adoption. Provides empirical support for the deployment gap statistics cited in Section 12.1. While the survey has inherent methodology limitations, it is one of the few quantitative sources on enterprise MLOps practices.
23. Gartner. (2022). "Gartner Survey Reveals 80% of Organizations Will Increase Investment in Applied AI by 2024." Gartner Research. The source of the widely cited statistic that most ML models fail to reach production. Gartner's analysis spans organizational readiness, talent availability, and infrastructure maturity — providing a macro view of the enterprise ML landscape that complements the practitioner-level discussion in this chapter.
Cost and Economics
24. Patterson, D., Gonzalez, J., Le, Q., et al. (2021). "Carbon Emissions and Large Neural Network Training." arXiv preprint arXiv:2104.10350. While focused on environmental costs rather than financial costs, this paper from Google provides a detailed analysis of the computational resources consumed by training large ML models. The methodology for estimating compute costs is applicable to any cost analysis of ML training and inference. Relevant to the cost management discussion in Section 12.13 and previews the sustainability considerations in Chapter 30 (Responsible AI in Practice).
25. Reddi, V. J., Cheng, C., Kanter, D., et al. (2020). "MLPerf Inference Benchmark." Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture. The MLPerf benchmark for inference performance — the standard for measuring and comparing model serving efficiency across hardware and software configurations. Understanding inference benchmarks is essential for cost optimization (Section 12.13), as inference efficiency directly determines per-prediction costs. Useful for readers evaluating hardware and software options for model serving.
This reading list spans the academic, practitioner, and vendor perspectives on MLOps. Priorities for different readers: Business leaders should start with items 1, 4, and 15. Technical practitioners should prioritize items 1, 12, 14, and 20. Researchers should begin with items 2, 5, and 16. Items 18 and 19 provide essential background for the chapter's case studies. All readers will benefit from item 1 (Sculley et al.) — it is the paper that launched the field of MLOps.