Further Reading: Chapter 35

Capstone --- End-to-End ML System


End-to-End ML Systems

1. Designing Machine Learning Systems --- Chip Huyen (2022) The definitive book on production ML systems. Huyen covers the full lifecycle: project scoping, data engineering, feature engineering, model development, deployment, monitoring, and maintenance. Chapters 8 ("Data Distribution Shifts and Monitoring") and 9 ("Continual Learning and Test in Production") are directly relevant to this capstone. The book's central thesis --- that ML in production is a systems problem, not a modeling problem --- is exactly the lesson of this chapter. O'Reilly.

2. Reliable Machine Learning --- Cathy Chen, Niall Richard Murphy, et al. (2022) Written by Google SREs and ML engineers, this book applies site reliability engineering principles to ML systems. Chapters on testing ML systems, monitoring for production ML, and incident management for ML failures. The "ML test score" framework (a checklist for production readiness) is directly applicable to the capstone checklist in this chapter. O'Reilly.

3. "Machine Learning: The High-Interest Credit Card of Technical Debt" --- Sculley et al. (2015) The Google paper that introduced the concept of ML-specific technical debt: hidden feedback loops, undeclared consumers, pipeline jungles, and configuration debt. Every production ML system accumulates this debt. The paper provides a taxonomy of debt types and mitigation strategies. Essential reading for anyone who has deployed a model and wondered why it got harder to maintain over time. Published at NeurIPS 2015 (Workshop on ML Systems).

4. "Hidden Technical Debt in Machine Learning Systems" --- Sculley et al. (2015) A companion to the "credit card" paper with more detailed case studies from Google. The key insight: the model code in a production ML system is a small fraction of the total codebase. The surrounding infrastructure --- data pipelines, feature stores, monitoring, configuration management --- is where most of the complexity and most of the bugs live. Figure 1 ("Only a small fraction of real-world ML systems is composed of the ML code") should be on every data scientist's wall. NeurIPS 2015.


Architecture and Design Patterns

5. Building Machine Learning Powered Applications --- Emmanuel Ameisen (2020) A practical guide to going from idea to deployed ML product. Ameisen walks through a complete project (a writing assistant) from initial scoping to production deployment. Chapters 8--10 on testing, monitoring, and iterating are the most relevant. The book includes code examples and architecture diagrams for each stage. O'Reilly.

6. "ML Design Patterns" --- Lakshmanan, Robinson, and Munn (2020) A catalog of 30 design patterns for production ML: the Hashed Feature pattern, the Feature Store pattern, the Transform pattern, the Windowed Inference pattern, and others. Each pattern includes the problem it solves, the solution, and tradeoffs. The Transform pattern (ensuring training-serving consistency) and the Workflow Pipeline pattern (orchestrating multi-step ML workflows) are directly applicable to the capstone. O'Reilly.

7. "Introducing MLflow: An Open Source Machine Learning Platform" --- Matei Zaharia et al. (2018) The original MLflow paper from Databricks. Covers the motivation for experiment tracking, the MLflow architecture, and early adoption results. While the tool has evolved significantly since 2018, the paper provides useful context on why experiment tracking became a standard practice. Relevant background for Component 4 (experiment tracking) of the capstone.


Monitoring and Drift Detection

8. "Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift" --- Rabanser, Gunnemann, and Lipton (2019) A systematic comparison of drift detection methods: univariate tests (KS, chi-squared), multivariate tests (MMD, classifier two-sample test), and dimensionality reduction approaches. The paper finds that no single method dominates and recommends using multiple complementary methods. Directly relevant to Component 8 (monitoring). Published at NeurIPS 2019.

9. "Monitoring Machine Learning Models in Production" --- Christopher Samiullah, Google Cloud Blog (2020) A practitioner-oriented guide to monitoring ML systems. Covers the four levels of monitoring: infrastructure (is the service running?), data quality (are the inputs valid?), model performance (are the predictions correct?), and business impact (is the system creating value?). The four-level framework maps directly to the monitoring architecture in the capstone.

10. Practical Monitoring --- Mike Julian (2017) Not ML-specific, but the best introduction to monitoring philosophy and practice. Julian's principles --- monitor symptoms, not causes; set alerts on things you can act on; every alert should trigger a defined response --- apply directly to ML monitoring. The chapter on "alert fatigue" is particularly relevant: too many drift alerts with no clear action plan leads to alerts being ignored. O'Reilly.


Fairness in Production Systems

11. "Model Cards for Model Reporting" --- Mitchell et al. (2019) The standard framework for documenting ML models intended for production deployment. A model card includes: intended use, training data description, evaluation metrics across subgroups, ethical considerations, and known limitations. The capstone's fairness audit produces the data that populates a model card. Creating a model card should be a standard step in any production deployment. Published at FAT* 2019.

12. "Fairness and Machine Learning: Limitations and Opportunities" --- Barocas, Hardt, and Narayanan (ongoing) A freely available textbook on ML fairness. Chapters 2 ("Classification") and 4 ("Causality") are most relevant to the capstone. The book provides a rigorous treatment of fairness metrics, the impossibility theorem (you cannot satisfy all fairness criteria simultaneously), and the relationship between fairness and causal inference. Available at fairmlbook.org.

13. "Aequitas: A Bias and Fairness Audit Toolkit" --- Saleiro et al. (2018) An open-source toolkit for auditing ML models for bias. Aequitas computes group-level fairness metrics (demographic parity, equal opportunity, predictive parity) and produces an audit report. The toolkit maps directly to Component 6 (fairness audit) of the capstone. Published at AAAI/ACM Conference on AI, Ethics, and Society.


Deployment and MLOps

14. Building Machine Learning Pipelines --- Hannes Hapke and Catherine Nelson (2020) A hands-on guide to building end-to-end ML pipelines with TFX, Apache Beam, and Kubeflow. Chapters 8--10 on model validation, pushing models to production, and pipeline automation are the most relevant. The authors distinguish between ML code (the model) and ML infrastructure (everything else) --- the same distinction this capstone emphasizes. O'Reilly.

15. "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction" --- Breck et al. (2017) A checklist of 28 tests for production ML systems, organized into four categories: tests for data, tests for model development, tests for ML infrastructure, and tests for monitoring. Each test is binary (pass/fail) and produces a score from 0 to 28. A score below 15 indicates the system is not production-ready. The rubric is directly applicable to the capstone and can be used as a self-assessment tool. Published in IEEE Big Data 2017.


Portfolio Building and Career

16. Build a Career in Data Science --- Emily Robinson and Jacqueline Nolis (2020) The most practical guide to data science careers. Chapter 7 ("Building a Portfolio") is directly relevant to the capstone. Robinson and Nolis argue that a portfolio project should demonstrate business thinking (why did you choose this problem?), technical depth (how did you solve it?), and communication (can you explain it?). The three capstone tracks in this chapter are designed with these criteria in mind. Manning.

17. "How to Build a Data Science Portfolio That Employers Will Love" --- Will Stanton (2019) A blog post that distills portfolio advice into actionable steps. Key recommendations: start with a question, not a dataset; include an end-to-end project; show your reasoning (not just your code); and write clearly. The post includes examples of strong and weak portfolio projects. Available at Will Stanton's blog.

18. "What We Look for in a Data Science Resume and Portfolio" --- Rachel Thomas, fast.ai (2018) Thomas (co-founder of fast.ai) describes what hiring managers actually look for: evidence of independent thinking, ability to communicate to non-technical audiences, and a project that goes beyond "I ran this notebook." The capstone's retrospective section directly addresses her criteria: "We want to see that you can look at your own work critically and identify improvements."


The Messy Reality

19. "What I Wish I Had Known Before Starting a Data Science Project" --- Monica Rogati (2017) Rogati (former VP of Data at Jawbone) describes the "AI hierarchy of needs": data collection, data flow, data exploration, aggregation/labeling, learning/optimization. Most failed ML projects fail because they tried to climb the pyramid without a solid base. The capstone's emphasis on data extraction (Component 2) and pipeline reproducibility (Component 3) reflects this hierarchy. Available on Quora/Medium.

20. "Engineers Shouldn't Write ETL: A Guide to Building a High-Functioning Data Team" --- Jeff Magnusson, Stitch Fix (2016) Magnusson argues that the boundary between data engineering and data science should be clearly defined, and that data scientists who spend most of their time on ETL are misallocated. The capstone's architecture separates data extraction (Component 2, which should be managed by data engineering) from feature engineering (Component 3, which is the data scientist's domain). Understanding this boundary is critical for working effectively in cross-functional teams.


How to Use This List

If you are building your capstone and want a design reference, start with Huyen (item 1). It is the most comprehensive treatment of production ML systems available.

If you are writing your retrospective and want to understand common failure modes, read Sculley et al. (items 3--4). They provide the vocabulary for describing ML-specific technical debt.

If you are preparing your stakeholder presentation, review Breck et al. (item 15). The ML test score rubric gives you a framework for arguing that your system is production-ready.

If you are building a portfolio and want to understand what hiring managers look for, read Robinson and Nolis (item 16) and Thomas (item 18). Both provide concrete criteria for evaluating portfolio projects.

If you are concerned about fairness in your deployed system, read Barocas et al. (item 12) for the theory and Saleiro et al. (item 13) for the tooling.

If you want to understand why your data pipeline is the most error-prone component of the system, read Magnusson (item 20) and Hapke and Nelson (item 14).


This reading list supports Chapter 35: Capstone --- End-to-End ML System. Return to the chapter to review concepts before diving in.