Chapter 34: Further Reading

MLOps Foundations

  • Sculley, D., Holt, G., Golovin, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." NeurIPS 2015. The seminal paper identifying the technical debt unique to ML systems: data dependencies, configuration complexity, and feedback loops that make ML systems difficult to maintain.

  • Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly Media. A comprehensive guide to building production ML systems, covering data engineering, feature engineering, model development, deployment, monitoring, and infrastructure.

  • Google Cloud. (2023). "MLOps: Continuous Delivery and Automation Pipelines in Machine Learning." Google Cloud Architecture Center. Google's authoritative guide to MLOps maturity levels (0, 1, 2) with detailed descriptions of the infrastructure and processes required at each level.

  • Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2022). "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys, 55(6). A systematic survey of real-world ML deployment challenges across industries, categorized by pipeline stage.

Experiment Tracking and Model Management

  • Weights & Biases. (2024). "W&B Documentation." Available at: https://docs.wandb.ai. Official documentation for Weights & Biases, the most widely used experiment tracking platform in deep learning research and industry.

  • Zaharia, M., Chen, A., Davidson, A., et al. (2018). "Accelerating the Machine Learning Lifecycle with MLflow." IEEE Data Engineering Bulletin, 41(4). The paper introducing MLflow, an open-source platform for managing the ML lifecycle including experimentation, reproducibility, and deployment.

  • Iterative. (2024). "DVC Documentation." Available at: https://dvc.org/doc. Official documentation for Data Version Control, covering data versioning, pipeline management, and experiment tracking with Git integration.

Monitoring and Drift Detection

  • Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). "A Survey on Concept Drift Adaptation." ACM Computing Surveys, 46(4). A comprehensive survey of concept drift types (sudden, gradual, incremental, recurring) and detection/adaptation methods.

  • Rabanser, S., Gunnemann, S., & Lipton, Z. C. (2019). "Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift." NeurIPS 2019. Evaluates various statistical tests for detecting dataset shift, providing practical guidance on which methods work best for different scenarios.

  • Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction." IEEE BigData 2017. Provides a scoring rubric for evaluating ML system maturity across tests for data, model, infrastructure, and monitoring.

LLMOps

  • Shankar, S., Zamfirescu-Pereira, J. D., Hartmann, B., et al. (2024). "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences." arXiv preprint arXiv:2404.12272. Investigates the reliability of LLM-as-judge evaluation and proposes methods for improving alignment with human judgments.

  • Ribeiro, M. T. & Lundberg, S. (2022). "Adaptive Testing and Debugging of NLP Models." ACL 2022. Introduces CheckList-style testing for NLP models with capability-based test generation and behavioral testing.

  • Dong, Y., Jiang, Y., Deng, Z., et al. (2024). "Guardrails for Large Language Models: A Survey." arXiv preprint arXiv:2402.01822. A survey of guardrail techniques for LLM applications covering input filtering, output verification, and safety enforcement.

  • Agrawal, R., Borealis AI Team. (2024). "Prompt Engineering Best Practices for Production LLM Applications." Various technical blog posts. Practical guidance on prompt versioning, A/B testing, and evaluation for production LLM systems.

Deployment and Serving

  • Crankshaw, D., Wang, X., Zhou, G., et al. (2017). "Clipper: A Low-Latency Online Prediction Serving System." NSDI 2017. Describes a model serving system addressing latency, throughput, and model management challenges in production ML.

  • Olston, C., Fiedel, N., Gorovoy, K., et al. (2017). "TensorFlow-Serving: Flexible, High-Performance ML Serving." NIPS Workshop on ML Systems. Architecture and design of TensorFlow Serving, influencing modern model serving infrastructure.

  • Bai, Y., Kadhe, S., Kundu, S., et al. (2024). "Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models." arXiv preprint arXiv:2401.00625. Surveys techniques for efficient LLM deployment including quantization, distillation, pruning, and efficient inference.

Testing for ML

  • Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList." ACL 2020. Proposes a methodology for behavioral testing of NLP models using minimum functionality tests, invariance tests, and directional expectation tests.

  • Amershi, S., Begel, A., Bird, C., et al. (2019). "Software Engineering for Machine Learning: A Case Study." ICSE-SEIP 2019. Microsoft's study of software engineering practices for ML, identifying best practices and common pitfalls from real projects.