Further Reading: Chapter 29

Software Engineering for Data Scientists

Project Structure and Workflow

1. Cookiecutter Data Science --- drivendata.github.io/cookiecutter-data-science The canonical project template for data science in Python. The website explains the rationale behind every directory in the layout: why raw data is separate from processed data, why notebooks are separate from source code, why a Makefile is the right abstraction for pipeline orchestration. The template itself is a cookiecutter generator, but the layout can be adopted manually without the tool. Version 2 (2024) updated the structure for modern tooling (pyproject.toml, DVC integration, containerization).

2. "Cookiecutter Data Science: Best Practices for Data Science Project Structures" --- DrivenData Blog A companion blog post that walks through the design decisions behind the template. The key insight: data science project structures should optimize for reproducibility and collaboration, not for the convenience of the original author. The post includes before-and-after examples of projects that adopted the structure, with concrete metrics on onboarding time and bug frequency.

3. Data Science at the Command Line --- Jeroen Janssens (2nd edition, 2021, O'Reilly) Covers the Unix philosophy applied to data science workflows: composable tools, standard input/output, Makefiles for pipeline orchestration, and shell scripting for automation. Chapters 6-8 on project organization, workflow automation, and reproducibility complement this chapter's material. The book demonstrates that many data pipeline tasks (downloading, cleaning, transforming, validating) can be done with command-line tools rather than Python scripts, which is often simpler and faster.

Testing for Data Science

4. Python Testing with pytest --- Brian Okken (2nd edition, 2022, Pragmatic Programmers) The definitive guide to pytest. Covers fixtures (including fixture scope, parametrization, and factories), markers, plugins, configuration, and CI integration. Chapters 3-5 on fixtures, parametrize, and built-in markers are directly relevant to testing data science code. The appendix on migrating from unittest to pytest is useful for teams inheriting legacy test suites.

5. "Effective Testing for Machine Learning Projects" --- Jeremy Jordan (jeremyjordan.me, 2020) A blog post that addresses the specific challenge of testing ML code, where outputs are stochastic and "correctness" is probabilistic. Covers strategies for testing data pipelines (schema tests, distribution tests, invariant tests), feature engineering (known-input tests, cross-feature consistency tests), and model training (smoke tests, regression tests, performance bound tests). The distinction between "the code is correct" and "the model is good" is the central insight.

6. "Testing Data Pipelines" --- Eugene Yan (eugeneyan.com, 2021) A practical guide to testing strategies specific to data engineering and ML pipelines. Covers table-level tests (row counts, null rates, uniqueness), column-level tests (range checks, distribution checks, referential integrity), and pipeline-level tests (schema stability between stages, end-to-end smoke tests). Includes examples using Great Expectations and custom pytest fixtures.

Code Quality Tools

7. Black Documentation --- black.readthedocs.io The official documentation for Black, the "uncompromising" Python formatter. The FAQ explains why Black makes so few configuration options available: fewer choices means fewer arguments, which means teams can focus on code logic instead of style debates. The "How Black wraps lines" section is useful for understanding why Black sometimes makes surprising formatting choices (and why those choices are usually correct).

8. Ruff Documentation --- docs.astral.sh/ruff Ruff is a Rust-based Python linter that replaces flake8, isort, pyflakes, and dozens of other tools. The documentation covers all 700+ rules, organized by category (pyflakes, pycodestyle, isort, bugbear, etc.). The "Configuring Ruff" page shows how to set up pyproject.toml for a data science project, including which rules to enable, which to ignore, and how to set per-file overrides for notebooks.

9. Mypy Documentation --- mypy.readthedocs.io The official documentation for Python's standard static type checker. The "Getting Started" section covers gradual typing: how to add type hints to an existing codebase incrementally without needing 100% coverage on day one. The "Common Issues" section addresses challenges specific to data science code, including typing pandas DataFrames, handling optional return types, and dealing with third-party libraries that lack type stubs.

10. Pre-commit Documentation --- pre-commit.com The official guide to the pre-commit framework. Covers hook configuration, supported languages (Python, Rust, Go, shell scripts), custom hooks, CI integration, and the hook lifecycle. The "Supported Hooks" page lists all available hooks from the community, including security scanners, secret detectors, YAML validators, and large-file blockers.

Technical Debt in ML Systems

11. "Hidden Technical Debt in Machine Learning Systems" --- Sculley et al. (Google, NeurIPS 2015) The foundational paper on ML technical debt. Its central argument --- that the ML code is a tiny fraction of a real-world ML system, surrounded by vastly larger data collection, feature extraction, configuration, monitoring, and serving infrastructure --- has shaped how the industry thinks about ML engineering. The paper introduces concepts now standard in the field: entangled features, hidden feedback loops, undeclared consumers, data dependency debt, and configuration debt. Read this paper. It is 9 pages and every page is actionable.

12. "Machine Learning: The High-Interest Credit Card of Technical Debt" --- Sculley et al. (Google, SE4ML Workshop 2014) The precursor to the NeurIPS paper, shorter and more focused on the financial metaphor. The key insight: ML technical debt compounds faster than traditional software debt because changes to data, features, or upstream systems can silently degrade model performance without changing any code. In traditional software, a bug causes a crash. In ML systems, a bug causes slightly worse predictions that nobody notices for months.

13. "Rules of Machine Learning: Best Practices for ML Engineering" --- Martin Zinkevich (Google, 2017) A list of 43 rules distilled from Google's experience building and maintaining ML systems at scale. Rule 1 ("Don't be afraid to launch a product without machine learning") and Rule 4 ("Keep the first model simple and get the infrastructure right") are particularly relevant to this chapter's argument that engineering infrastructure matters more than model sophistication. Available free at developers.google.com/machine-learning/guides/rules-of-ml.

Refactoring and Software Design

14. Refactoring: Improving the Design of Existing Code --- Martin Fowler (2nd edition, 2018, Addison-Wesley) The definitive reference on refactoring. The catalog of refactoring patterns --- Extract Function, Inline Function, Replace Temp with Query, Replace Conditional with Polymorphism --- applies directly to data science code. The Extract Function pattern alone (identifying a block of code, giving it a name, and moving it to a function) covers 80% of the refactoring work in this chapter. The second edition uses JavaScript examples, but the principles are language-agnostic.

15. Clean Code: A Handbook of Agile Software Craftsmanship --- Robert C. Martin (2008, Prentice Hall) The principles in this book --- meaningful names, small functions, one level of abstraction per function, no side effects, DRY --- are as relevant to data science code as to web application code. Chapter 3 (Functions) and Chapter 9 (Unit Tests) are the most directly applicable. The book's examples are in Java, but the principles transfer to Python without modification.

16. "The Pragmatic Programmer" --- David Thomas and Andrew Hunt (20th Anniversary Edition, 2019, Addison-Wesley) A broader take on software craftsmanship that covers version control philosophy, the DRY principle (which the authors coined), the concept of "tracer bullets" (thin end-to-end implementations before full feature builds), and the importance of automation. The chapter on "Pragmatic Paranoia" --- writing code that assumes your own code is wrong --- maps directly to the defensive testing practices recommended in this chapter.

Version Control for Data Science

17. Pro Git --- Scott Chacon and Ben Straub (2nd edition, 2014, Apress) The standard reference on git, freely available at git-scm.com/book. Chapters 3 (branching) and 7 (customization, including hooks and attributes) are directly relevant. The section on git attributes and clean/smudge filters explains the mechanism that nbstripout uses to strip notebook outputs. The section on git filter-branch covers how to remove large files from repository history (as Marcus's team needed in Case Study 2).

18. DVC Documentation --- dvc.org/doc Data Version Control (DVC) extends git to handle data files and model artifacts. Instead of committing large files to git, DVC stores them in remote storage (S3, GCS, Azure) and commits only small .dvc metafiles that reference them. The documentation covers pipeline definition, experiment tracking, and metric comparison --- features that complement the project structure and testing practices in this chapter.

19. "Jupyter Notebook Best Practices for Data Science" --- Jonathan Whitmore (Svds.com, 2016) An early and still-relevant set of guidelines for notebook hygiene: naming conventions, cell ordering, the separation of exploration from production code, and the use of nbconvert for generating reports. The central argument --- that notebooks should be the place where you tell a story, not the place where you build a system --- aligns with this chapter's recommendation to treat notebooks as a visualization layer that imports from src/.

Applied Software Engineering for Data Teams

20. "Software Engineering for Data Scientists" --- Catherine Nelson (2023, O'Reilly) A book-length treatment of the topics in this chapter, written specifically for data scientists transitioning to production codebases. Covers project structure, testing, CI/CD, code review, and collaboration patterns. The case studies are drawn from real data science teams and include the kinds of messy, pragmatic compromises that production systems require.

21. "Designing Machine Learning Systems" --- Chip Huyen (2022, O'Reilly) A comprehensive guide to building ML systems from data engineering to production monitoring. Chapter 2 (Introduction to Machine Learning Systems Design) and Chapter 11 (Infrastructure and Tooling) provide the broader systems context for this chapter's software engineering practices. Huyen's treatment of data iteration (changing data is faster and more impactful than changing models) reinforces the importance of well-engineered data pipelines.

Further reading for Chapter 29: Software Engineering for Data Scientists. Return to the chapter for the core material.