Chapter 29: Software Engineering for Data Scientists

DataField.Dev

16 min read

> War Story --- A senior data scientist at a Series C fintech startup was asked to onboard a new hire. "Here's the repo," she said, and pointed to a shared drive containing 47 notebooks. The new hire opened the first one: churn_model_final.ipynb...

In This Chapter

Version Control, Testing, Code Quality, and Technical Debt
The Notebook Graveyard
Project Structure: The Cookiecutter Convention
Version Control: Git for ML Projects
Testing Data Science Code
Code Quality: Automated Formatting, Linting, and Type Checking
Refactoring: From Notebook to Module
Technical Debt in Machine Learning Systems
Putting It All Together: The Refactored StreamFlow Pipeline
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 29: Software Engineering for Data Scientists

Version Control, Testing, Code Quality, and Technical Debt

Learning Objectives

By the end of this chapter, you will be able to:

Structure a data science project with cookiecutter-data-science conventions
Write unit tests and integration tests for data pipelines
Apply code quality tools (black, ruff, mypy) to data science code
Refactor notebook code into importable Python modules
Manage technical debt in ML systems

The Notebook Graveyard

War Story --- A senior data scientist at a Series C fintech startup was asked to onboard a new hire. "Here's the repo," she said, and pointed to a shared drive containing 47 notebooks. The new hire opened the first one: churn_model_final.ipynb. Then churn_model_final_v2.ipynb. Then churn_model_ACTUALLY_FINAL_v3.ipynb. Then churn_model_ACTUALLY_FINAL_v3_fixed.ipynb. Nobody could remember which notebook produced the model currently running in production. The senior data scientist had written most of them, and even she was not sure. The onboarding took three weeks instead of three days. Two of those weeks were spent reverse-engineering which cells to run in which order.

This is not an exceptional story. It is the default state of most data science teams. Notebooks are wonderful for exploration. They are terrible for production, collaboration, and maintenance. The problem is not the notebook itself --- it is that notebooks encourage a style of coding that violates every principle software engineers learned the hard way over the past five decades: no modularity, no tests, hidden state from out-of-order execution, copy-pasted functions, hardcoded paths, and implicit dependencies that only work on one person's machine.

This chapter teaches you the engineering practices that transform notebook experiments into maintainable systems. You are not a data scientist writing software. You are a software engineer building data science systems. Act accordingly.

We cover five topics, each one a layer in the stack of professional data science code:

Project structure --- standardized directory layouts that any data scientist can navigate
Version control --- git branching strategies for ML projects
Testing --- unit tests and integration tests for data pipelines
Code quality --- automated formatting, linting, and type checking
Technical debt --- the hidden costs that accumulate in ML systems

The running example is StreamFlow's churn prediction project. By the end of this chapter, the notebook from Chapters 1--19 will be a proper Python package with tests, type hints, and a reproducible entry point.

Project Structure: The Cookiecutter Convention

The first decision in any software project is where things go. In data science, this decision is usually deferred until it is too late, resulting in a flat directory of notebooks, CSVs, pickle files, and a README that says "TODO."

The cookiecutter-data-science template provides a standardized answer. It was created by DrivenData and has become the de facto convention for Python-based data science projects. You do not need to use the cookiecutter tool itself --- the structure is what matters.

# Install and generate a new project
pip install cookiecutter
cookiecutter https://github.com/drivendata/cookiecutter-data-science

The resulting layout looks like this:

streamflow-churn/
    README.md
    LICENSE
    Makefile                   # Top-level commands (make data, make train, etc.)
    pyproject.toml             # Project metadata and dependencies
    setup.cfg                  # Package configuration
    .env                       # Environment variables (never commit this)
    .gitignore
    data/
        raw/                   # Immutable original data
        interim/               # Intermediate transformations
        processed/             # Final feature matrices
        external/              # Third-party data sources
    models/                    # Trained model artifacts
    notebooks/                 # Jupyter notebooks (exploration only)
        01-eda.ipynb
        02-feature-engineering.ipynb
        03-modeling.ipynb
    references/                # Data dictionaries, papers, manuals
    reports/
        figures/               # Generated graphics for reporting
    src/
        __init__.py
        data/
            __init__.py
            make_dataset.py    # Download, clean, partition raw data
        features/
            __init__.py
            build_features.py  # Feature engineering logic
        models/
            __init__.py
            train_model.py     # Training script
            predict_model.py   # Prediction/inference script
        evaluation/
            __init__.py
            evaluate.py        # Metric computation
        visualization/
            __init__.py
            visualize.py       # Plotting functions
    tests/
        __init__.py
        test_data.py
        test_features.py
        test_models.py

Key Principle --- Data flows one direction: raw/ to interim/ to processed/. Raw data is immutable. Every transformation is code, not manual editing. If you delete everything except data/raw/ and src/, you can regenerate the entire project.

Why This Structure Works

The cookiecutter layout enforces three principles that notebooks violate:

Separation of concerns. Data loading, feature engineering, modeling, and evaluation live in separate modules. A change to your feature engineering does not require editing your training script. A bug in your evaluation metrics does not require re-running your entire pipeline.

Importability. Code in src/ can be imported by notebooks, scripts, tests, and other modules. Code in a notebook cell cannot be imported by anything. The moment you write a function you will use more than once, it belongs in src/, not in a notebook cell.

Reproducibility. The Makefile (or equivalent pyproject.toml scripts) defines the exact sequence of commands to reproduce every result. make data generates the processed dataset. make train trains the model. make test runs the test suite. A new team member can reproduce your results without reading your notebooks.

# pyproject.toml --- define project entry points
[project.scripts]
streamflow-data = "src.data.make_dataset:main"
streamflow-train = "src.models.train_model:main"
streamflow-predict = "src.models.predict_model:main"

Adapting the Structure

The cookiecutter template is a starting point, not a straitjacket. Common adaptations:

Add src/config/ for configuration files (hyperparameters, feature lists, data paths)
Add src/pipelines/ for orchestration code that chains data, features, and training
Replace Makefile with dvc (Data Version Control) for data pipeline DAGs
Add docker/ for containerization (Chapter 31)
Add mlruns/ or wandb/** for experiment tracking artifacts (Chapter 30)

The notebooks directory remains, but its role changes. Notebooks are for exploration, visualization, and communication. They import from src/ rather than containing production logic. A good test: if you deleted every notebook, could you still train and deploy the model? If yes, your structure is correct.

Version Control: Git for ML Projects

You already know git. You know add, commit, push, pull, branch, merge. What you may not know is how to use git effectively in a machine learning context, where the artifacts are not just code but also data, models, and experiment configurations.

The .gitignore for Data Science

Your .gitignore is the first line of defense against repository bloat. Data science projects generate large binary files that do not belong in git.

# .gitignore for data science projects

# Data (use DVC or cloud storage)
data/raw/*
data/interim/*
data/processed/*
!data/raw/.gitkeep
!data/interim/.gitkeep
!data/processed/.gitkeep

# Model artifacts
models/*.pkl
models/*.joblib
models/*.h5
models/*.onnx

# Environment
.env
.venv/
*.egg-info/

# Notebooks --- track, but strip outputs
notebooks/.ipynb_checkpoints/

# IDE
.vscode/
.idea/

# OS
.DS_Store
Thumbs.db

# Experiment tracking
mlruns/
wandb/

Common Mistake --- Committing a 500 MB pickle file to git. Once a large file is in git history, it is there forever (even after deletion). Use Git LFS for large files you must version, or DVC for data and model versioning that integrates with cloud storage. Never commit data to git unless it is a small reference file (< 1 MB).

Branching Strategy for ML Projects

Software teams debate branching strategies endlessly. For data science teams, the following pattern works well:

main                  # Production-ready code. Always deployable.
  develop             # Integration branch. Merges from feature branches.
    feature/add-recency-features
    feature/try-lightgbm
    experiment/tune-xgb-lr
    bugfix/fix-null-handling
    data/update-2024q4

The key additions for ML projects are:

experiment/ branches for exploratory modeling work that may or may not be merged. These branches are cheap. Create them freely, delete them when the experiment concludes.
data/ branches for changes to data processing pipelines. Data changes can silently invalidate model performance, so they deserve their own review process.

# Create an experiment branch
git checkout -b experiment/tune-xgb-lr develop

# Work, commit, evaluate
# If results are good, merge to develop
git checkout develop
git merge experiment/tune-xgb-lr

# If results are bad, delete the branch
git branch -d experiment/tune-xgb-lr

Commit Messages That Help Future You

A commit history full of "fixed stuff" and "updated notebook" is worthless. Data science commits should communicate what changed and why.

# Bad
fixed stuff
updated model
changes

# Good
feat: add 30-day rolling engagement features to churn model
fix: handle null values in plan_change_date column
experiment: test LightGBM with dart boosting (AUC 0.847 -> 0.852)
data: update Q4 2024 event logs with corrected timezone handling
refactor: extract feature engineering from notebook into src/features/

The experiment: prefix is particularly useful. It records the result in the commit message itself, creating a searchable log of what you tried and what worked.

Notebook Version Control

Notebooks are JSON files that contain code, outputs, and metadata. Diffs are unreadable, outputs inflate repository size, and merge conflicts are nearly impossible to resolve.

Three approaches:

1. Strip outputs before committing. Use nbstripout as a pre-commit hook.

pip install nbstripout
nbstripout --install  # Installs git filter to strip outputs on commit

2. Pair notebooks with scripts. Use jupytext to maintain a .py version alongside each .ipynb. The .py file diffs cleanly and can be reviewed in pull requests.

pip install jupytext
jupytext --set-formats ipynb,py:percent notebooks/01-eda.ipynb

3. Treat notebooks as disposable. The real code lives in src/. Notebooks import from src/ and exist only for exploration. If a notebook is lost, nothing of value is lost, because all logic is in version-controlled Python modules.

The third approach is the correct one for production projects.

Testing Data Science Code

The uncomfortable truth: most data scientists have never written a test. Not a unit test, not an integration test, not a smoke test. The model evaluation metrics they compute --- accuracy, AUC, RMSE --- are not tests. They are measurements. A test has an expected outcome. A test passes or fails. A test runs automatically, every time you change the code.

Why Data Science Code Needs Tests

"But my code is not a web app. It is a pipeline that transforms data and trains a model. What would I even test?"

You test that your code does what you think it does. Specifically:

Data loading functions return the expected schema (column names, dtypes, no unexpected nulls)
Feature engineering functions produce correct values for known inputs
Preprocessing pipelines handle edge cases (empty DataFrames, single-row DataFrames, all-null columns)
Model training produces a model that can make predictions (smoke test)
Evaluation functions compute correct metrics for hand-calculated examples

Unit Tests with pytest

A unit test tests a single function in isolation, with controlled inputs and a known expected output.

# tests/test_features.py

import pandas as pd
import numpy as np
import pytest
from src.features.build_features import (
    compute_recency_features,
    compute_engagement_rate,
    compute_plan_change_flag,
)


class TestComputeRecencyFeatures:
    """Tests for the recency feature engineering function."""

    def test_days_since_last_event(self):
        """Recency should be the difference between reference date and last event."""
        events = pd.DataFrame({
            'user_id': [1, 1, 2, 2],
            'event_date': pd.to_datetime([
                '2024-10-01', '2024-10-15',
                '2024-10-05', '2024-10-20',
            ]),
        })
        reference_date = pd.Timestamp('2024-10-31')
        result = compute_recency_features(events, reference_date)

        assert result.loc[result['user_id'] == 1, 'days_since_last_event'].iloc[0] == 16
        assert result.loc[result['user_id'] == 2, 'days_since_last_event'].iloc[0] == 11

    def test_user_with_no_events_returns_nan(self):
        """Users present in the user table but absent from events should get NaN."""
        events = pd.DataFrame({
            'user_id': pd.Series([], dtype='int64'),
            'event_date': pd.Series([], dtype='datetime64[ns]'),
        })
        reference_date = pd.Timestamp('2024-10-31')
        result = compute_recency_features(events, reference_date)

        assert len(result) == 0

    def test_single_event_user(self):
        """A user with exactly one event should have correct recency."""
        events = pd.DataFrame({
            'user_id': [1],
            'event_date': pd.to_datetime(['2024-10-25']),
        })
        reference_date = pd.Timestamp('2024-10-31')
        result = compute_recency_features(events, reference_date)

        assert result['days_since_last_event'].iloc[0] == 6


class TestComputeEngagementRate:
    """Tests for the engagement rate calculation."""

    def test_basic_engagement_rate(self):
        """Engagement rate = active_days / total_days."""
        result = compute_engagement_rate(active_days=15, total_days=30)
        assert result == pytest.approx(0.5)

    def test_zero_active_days(self):
        """Zero active days should return 0.0, not raise an error."""
        result = compute_engagement_rate(active_days=0, total_days=30)
        assert result == 0.0

    def test_zero_total_days_raises_error(self):
        """Division by zero should raise ValueError, not return inf."""
        with pytest.raises(ValueError, match="total_days must be positive"):
            compute_engagement_rate(active_days=5, total_days=0)

    def test_engagement_rate_capped_at_one(self):
        """If active_days > total_days (data error), cap at 1.0."""
        result = compute_engagement_rate(active_days=35, total_days=30)
        assert result == 1.0

Run tests with:

# Run all tests
pytest tests/ -v

# Run tests for a specific module
pytest tests/test_features.py -v

# Run tests matching a keyword
pytest tests/ -v -k "engagement"

tests/test_features.py::TestComputeRecencyFeatures::test_days_since_last_event PASSED
tests/test_features.py::TestComputeRecencyFeatures::test_user_with_no_events_returns_nan PASSED
tests/test_features.py::TestComputeRecencyFeatures::test_single_event_user PASSED
tests/test_features.py::TestComputeEngagementRate::test_basic_engagement_rate PASSED
tests/test_features.py::TestComputeEngagementRate::test_zero_active_days PASSED
tests/test_features.py::TestComputeEngagementRate::test_zero_total_days_raises_error PASSED
tests/test_features.py::TestComputeEngagementRate::test_engagement_rate_capped_at_one PASSED

7 passed in 0.34s

Fixtures: Reusable Test Data

When multiple tests need the same input data, use pytest fixtures instead of copy-pasting DataFrames.

# tests/conftest.py

import pandas as pd
import numpy as np
import pytest


@pytest.fixture
def sample_events():
    """A small but realistic event DataFrame for testing."""
    np.random.seed(42)
    return pd.DataFrame({
        'user_id': [1, 1, 1, 2, 2, 3],
        'event_type': ['page_view', 'video_start', 'video_complete',
                       'page_view', 'search', 'page_view'],
        'timestamp': pd.to_datetime([
            '2024-10-01 08:00', '2024-10-01 08:15', '2024-10-01 08:45',
            '2024-10-02 12:00', '2024-10-02 12:30', '2024-10-03 18:00',
        ]),
        'duration_seconds': [30, 120, 0, 45, 10, 22],
        'plan_type': ['premium', 'premium', 'premium',
                      'basic', 'basic', 'standard'],
    })


@pytest.fixture
def sample_subscribers():
    """A small subscriber table for testing."""
    return pd.DataFrame({
        'user_id': [1, 2, 3, 4],
        'signup_date': pd.to_datetime([
            '2023-01-15', '2023-06-01', '2024-01-01', '2024-03-15'
        ]),
        'plan_type': ['premium', 'basic', 'standard', 'basic'],
        'monthly_revenue': [14.99, 9.99, 12.99, 9.99],
        'churned': [0, 1, 0, 1],
    })


@pytest.fixture
def trained_model(sample_events, sample_subscribers):
    """A fitted model for smoke testing prediction functions."""
    from sklearn.ensemble import GradientBoostingClassifier
    from src.features.build_features import build_feature_matrix

    X, y = build_feature_matrix(sample_events, sample_subscribers)
    model = GradientBoostingClassifier(n_estimators=10, random_state=42)
    model.fit(X, y)
    return model

Fixtures are injected by name. Any test function that includes sample_events as a parameter will receive the DataFrame automatically.

# tests/test_data.py

def test_event_schema(sample_events):
    """Event DataFrame should have the expected columns and dtypes."""
    expected_columns = ['user_id', 'event_type', 'timestamp',
                        'duration_seconds', 'plan_type']
    assert list(sample_events.columns) == expected_columns
    assert sample_events['user_id'].dtype in ['int64', 'int32']
    assert pd.api.types.is_datetime64_any_dtype(sample_events['timestamp'])


def test_no_null_user_ids(sample_events):
    """user_id should never be null --- it is the primary key."""
    assert sample_events['user_id'].notna().all()

Parametrize: Testing Multiple Inputs

When you want to test the same function with many different inputs, use @pytest.mark.parametrize instead of writing a separate test for each case.

# tests/test_features.py

@pytest.mark.parametrize("active_days, total_days, expected", [
    (0, 30, 0.0),
    (15, 30, 0.5),
    (30, 30, 1.0),
    (1, 365, pytest.approx(0.00274, abs=1e-4)),
    (35, 30, 1.0),  # Capped at 1.0
])
def test_engagement_rate_parametrized(active_days, total_days, expected):
    """Engagement rate should handle a range of input combinations."""
    from src.features.build_features import compute_engagement_rate
    result = compute_engagement_rate(active_days, total_days)
    assert result == expected

Integration Tests for Pipelines

A unit test tests one function. An integration test tests multiple functions working together --- an entire pipeline from raw data to output.

# tests/test_pipeline.py

import pandas as pd
import numpy as np
import pytest
from src.data.make_dataset import load_and_clean
from src.features.build_features import build_feature_matrix
from src.models.train_model import train_churn_model
from src.evaluation.evaluate import compute_metrics


class TestEndToEndPipeline:
    """Integration tests for the full churn prediction pipeline."""

    def test_pipeline_produces_predictions(self, sample_events, sample_subscribers):
        """The full pipeline should run without errors and produce predictions."""
        # Step 1: Clean data
        cleaned_events = load_and_clean(sample_events)
        assert len(cleaned_events) > 0

        # Step 2: Build features
        X, y = build_feature_matrix(cleaned_events, sample_subscribers)
        assert X.shape[0] == y.shape[0]
        assert X.shape[1] > 0
        assert not X.isnull().any().any(), "Feature matrix contains NaN values"

        # Step 3: Train model
        model = train_churn_model(X, y)
        assert hasattr(model, 'predict')
        assert hasattr(model, 'predict_proba')

        # Step 4: Evaluate
        predictions = model.predict(X)
        probabilities = model.predict_proba(X)[:, 1]
        metrics = compute_metrics(y, predictions, probabilities)
        assert 'auc' in metrics
        assert 0.0 <= metrics['auc'] <= 1.0

    def test_pipeline_handles_single_class(self, sample_events):
        """Pipeline should not crash when all subscribers have the same label."""
        subscribers = pd.DataFrame({
            'user_id': [1, 2, 3],
            'signup_date': pd.to_datetime(['2023-01-15'] * 3),
            'plan_type': ['basic'] * 3,
            'monthly_revenue': [9.99] * 3,
            'churned': [0, 0, 0],  # No positive class
        })
        cleaned_events = load_and_clean(sample_events)
        X, y = build_feature_matrix(cleaned_events, subscribers)

        # Should either raise a clear error or handle gracefully
        with pytest.raises(ValueError, match="single class"):
            train_churn_model(X, y)

    def test_output_schema_stability(self, sample_events, sample_subscribers):
        """Feature matrix columns should be deterministic across runs."""
        cleaned = load_and_clean(sample_events)
        X1, _ = build_feature_matrix(cleaned, sample_subscribers)
        X2, _ = build_feature_matrix(cleaned, sample_subscribers)

        assert list(X1.columns) == list(X2.columns)
        assert X1.shape == X2.shape

What Not to Test

Testing is valuable, but not everything deserves a test. Use your judgment:

Worth testing	Not worth testing
Feature engineering logic	Notebook visualizations
Data validation (schema, nulls, ranges)	Exploratory analysis code
Custom metrics and loss functions	Third-party library internals
Pipeline end-to-end smoke tests	Model accuracy (that is evaluation, not testing)
Edge cases (empty data, single row, all nulls)	Exact floating-point outputs (use `pytest.approx`)

Recurring Theme: Real World =/= Kaggle --- In a Kaggle competition, you have one dataset, one target, and one leaderboard. Testing is irrelevant because the competition ends and the code is discarded. In production, your pipeline runs on new data every day, your features evolve, your team grows, and your code must survive contact with reality. Tests are the insurance policy that lets you change code with confidence.

Code Quality: Automated Formatting, Linting, and Type Checking

You have structured your project. You have written tests. Now make the code itself readable, consistent, and correct.

Three tools form the modern Python code quality stack:

Black: The Uncompromising Formatter

Black reformats your code to a single, consistent style. There are no configuration options to argue about (well, almost none). The team uses black, and all code looks the same.

pip install black

# Format a single file
black src/features/build_features.py

# Format the entire project
black src/ tests/

# Check without modifying (useful in CI)
black --check src/ tests/

Before black:

def build_feature_matrix(events,subscribers,reference_date=None,
    include_recency=True,include_frequency=True,
                         include_monetary = True):
    if reference_date is None: reference_date=events['timestamp'].max()
    features=[]
    if include_recency:
        features.append(compute_recency_features(events,reference_date))
    if include_frequency:features.append(compute_frequency_features(events,reference_date))
    if include_monetary: features.append(  compute_monetary_features(events, reference_date))
    return pd.concat(features,axis=1)

After black:

def build_feature_matrix(
    events,
    subscribers,
    reference_date=None,
    include_recency=True,
    include_frequency=True,
    include_monetary=True,
):
    if reference_date is None:
        reference_date = events["timestamp"].max()
    features = []
    if include_recency:
        features.append(compute_recency_features(events, reference_date))
    if include_frequency:
        features.append(compute_frequency_features(events, reference_date))
    if include_monetary:
        features.append(compute_monetary_features(events, reference_date))
    return pd.concat(features, axis=1)

The second version is not "better" by aesthetic preference. It is better because every function in the codebase follows the same formatting rules, which means diffs are smaller, code reviews focus on logic instead of style, and new team members can read the code without adapting to individual formatting preferences.

Ruff: The Fast Linter

Ruff is a Python linter written in Rust. It replaces flake8, isort, pyflakes, and dozens of other tools with a single, fast binary. It catches bugs, enforces style rules, and auto-fixes many issues.

pip install ruff

# Lint the project
ruff check src/ tests/

# Auto-fix what can be fixed
ruff check --fix src/ tests/

# Sort imports
ruff check --select I --fix src/ tests/

Common issues ruff catches in data science code:

# F841: Local variable 'df' is assigned but never used
df = pd.read_csv('data.csv')
result = pd.read_csv('other_data.csv')  # Oops, meant to use df

# E711: Comparison to None (use 'is None', not '== None')
if value == None:  # Wrong
if value is None:  # Correct

# F401: Imported but unused
import numpy as np
import scipy  # Never used --- ruff flags this

# W291: Trailing whitespace
# W292: No newline at end of file
# E501: Line too long (configurable)

Configure ruff in pyproject.toml:

[tool.ruff]
line-length = 100  # Match black's line length if you changed it
target-version = "py310"

[tool.ruff.lint]
select = [
    "E",    # pycodestyle errors
    "W",    # pycodestyle warnings
    "F",    # pyflakes
    "I",    # isort (import sorting)
    "N",    # pep8-naming
    "UP",   # pyupgrade
    "B",    # bugbear (common bugs)
    "SIM",  # simplify
]
ignore = ["E501"]  # Let black handle line length

[tool.ruff.lint.per-file-ignores]
"notebooks/*" = ["E402"]  # Allow imports not at top in notebooks

Mypy: Static Type Checking

Mypy checks type annotations without running the code. Type hints do not affect runtime behavior, but they catch entire categories of bugs at development time and serve as living documentation.

pip install mypy

# Type-check the source code
mypy src/ --ignore-missing-imports

Adding type hints to data science code:

# src/features/build_features.py

import pandas as pd
import numpy as np
from typing import Optional


def compute_engagement_rate(active_days: int, total_days: int) -> float:
    """Compute the engagement rate as active_days / total_days.

    Args:
        active_days: Number of days with at least one event.
        total_days: Total number of days in the observation window.

    Returns:
        Engagement rate, capped at 1.0.

    Raises:
        ValueError: If total_days is not positive.
    """
    if total_days <= 0:
        raise ValueError("total_days must be positive")
    return min(active_days / total_days, 1.0)


def compute_recency_features(
    events: pd.DataFrame,
    reference_date: pd.Timestamp,
) -> pd.DataFrame:
    """Compute days since last event per user.

    Args:
        events: DataFrame with columns ['user_id', 'event_date'].
        reference_date: The date from which to measure recency.

    Returns:
        DataFrame with columns ['user_id', 'days_since_last_event'].
    """
    last_event = events.groupby("user_id")["event_date"].max().reset_index()
    last_event["days_since_last_event"] = (
        reference_date - last_event["event_date"]
    ).dt.days
    return last_event[["user_id", "days_since_last_event"]]


def build_feature_matrix(
    events: pd.DataFrame,
    subscribers: pd.DataFrame,
    reference_date: Optional[pd.Timestamp] = None,
) -> tuple[pd.DataFrame, pd.Series]:
    """Build the full feature matrix for churn prediction.

    Args:
        events: Raw event log DataFrame.
        subscribers: Subscriber metadata DataFrame.
        reference_date: Observation cutoff date. Defaults to max event timestamp.

    Returns:
        Tuple of (feature_matrix, target_series).
    """
    if reference_date is None:
        reference_date = events["timestamp"].max()
    # ... implementation

Pragmatic Advice --- Do not try to add type hints to your entire codebase overnight. Start with function signatures for src/ modules. Skip notebooks entirely. Use # type: ignore for complex pandas operations where mypy's understanding of DataFrame types is limited. The goal is gradual improvement, not perfection.

Pre-Commit Hooks: Automate Everything

Pre-commit hooks run checks automatically before every git commit. If any check fails, the commit is rejected, and you must fix the issue before committing. This prevents bad code from ever entering the repository.

pip install pre-commit

Create .pre-commit-config.yaml in the project root:

# .pre-commit-config.yaml

repos:
  - repo: https://github.com/psf/black
    rev: 24.4.2
    hooks:
      - id: black

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.4.4
    hooks:
      - id: ruff
        args: [--fix]

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.10.0
    hooks:
      - id: mypy
        additional_dependencies: [pandas-stubs, types-requests]
        args: [--ignore-missing-imports]

  - repo: https://github.com/kynan/nbstripout
    rev: 0.7.1
    hooks:
      - id: nbstripout

  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.6.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files
        args: ['--maxkb=500']

# Install the hooks
pre-commit install

# Run manually on all files (first time)
pre-commit run --all-files

Now every commit is automatically formatted, linted, type-checked, and stripped of notebook outputs. The team cannot commit unformatted code, unused imports, or 500 MB pickle files.

Refactoring: From Notebook to Module

Refactoring is the act of restructuring existing code without changing its behavior. In data science, the most common refactoring is extracting logic from a notebook into importable Python modules.

The Extraction Pattern

The pattern is mechanical:

Identify a logical unit of work in the notebook (e.g., "all the cells that compute recency features")
Extract those cells into a function with explicit inputs and outputs
Move the function to the appropriate module in src/
Replace the notebook cells with an import and a function call
Write a test for the extracted function

Before (in notebook):

# Cell 47: Compute recency features
last_event_dates = events_df.groupby('user_id')['timestamp'].max().reset_index()
last_event_dates.columns = ['user_id', 'last_event_date']
last_event_dates['days_since_last_event'] = (
    pd.Timestamp('2024-10-31') - last_event_dates['last_event_date']
).dt.days
last_event_dates['days_since_last_video'] = ...  # 15 more lines
last_event_dates['days_since_last_search'] = ...  # 15 more lines

# Cell 48: Merge with subscriber table
features = subscribers_df.merge(last_event_dates, on='user_id', how='left')

After (in src/features/build_features.py):

def compute_recency_features(
    events: pd.DataFrame,
    reference_date: pd.Timestamp,
) -> pd.DataFrame:
    """Compute recency features per user.

    Calculates days since last event overall, last video event,
    and last search event relative to the reference date.

    Args:
        events: Event log with columns ['user_id', 'timestamp', 'event_type'].
        reference_date: The cutoff date for recency calculation.

    Returns:
        DataFrame with columns ['user_id', 'days_since_last_event',
        'days_since_last_video', 'days_since_last_search'].
    """
    last_overall = (
        events.groupby("user_id")["timestamp"]
        .max()
        .reset_index()
        .rename(columns={"timestamp": "last_event_date"})
    )
    last_overall["days_since_last_event"] = (
        reference_date - last_overall["last_event_date"]
    ).dt.days

    last_video = (
        events[events["event_type"].isin(["video_start", "video_complete"])]
        .groupby("user_id")["timestamp"]
        .max()
        .reset_index()
        .rename(columns={"timestamp": "last_video_date"})
    )
    last_video["days_since_last_video"] = (
        reference_date - last_video["last_video_date"]
    ).dt.days

    last_search = (
        events[events["event_type"] == "search"]
        .groupby("user_id")["timestamp"]
        .max()
        .reset_index()
        .rename(columns={"timestamp": "last_search_date"})
    )
    last_search["days_since_last_search"] = (
        reference_date - last_search["last_search_date"]
    ).dt.days

    result = last_overall[["user_id", "days_since_last_event"]]
    result = result.merge(
        last_video[["user_id", "days_since_last_video"]], on="user_id", how="left"
    )
    result = result.merge(
        last_search[["user_id", "days_since_last_search"]], on="user_id", how="left"
    )
    return result

After (in notebook):

# Cell 47: Compute recency features
from src.features.build_features import compute_recency_features

recency_features = compute_recency_features(
    events_df,
    reference_date=pd.Timestamp('2024-10-31'),
)
recency_features.head()

The function has explicit inputs (events DataFrame, reference date), explicit outputs (recency DataFrame), no hidden state, and a docstring that explains what it does. It can be tested, reused, and modified without opening a notebook.

The DRY Principle

DRY stands for "Don't Repeat Yourself." If the same logic appears in two places, it will eventually diverge --- one copy will be updated, the other will not, and you will spend hours debugging a discrepancy.

The most common DRY violation in data science: copy-pasting feature engineering code between training and prediction pipelines.

# BAD: Feature engineering duplicated in two places

# In train_model.py:
df['tenure_months'] = (pd.Timestamp('2024-10-31') - df['signup_date']).dt.days / 30
df['log_revenue'] = np.log1p(df['monthly_revenue'])
df['engagement_rate'] = df['active_days'] / df['total_days']

# In predict_model.py (copy-pasted, subtly different):
df['tenure_months'] = (pd.Timestamp.now() - df['signup_date']).dt.days / 30  # Different date!
df['log_revenue'] = np.log(df['monthly_revenue'])  # log instead of log1p!
df['engagement_rate'] = df['active_days'] / df['total_days']

The training pipeline uses a fixed reference date; the prediction pipeline uses now(). The training pipeline uses log1p; the prediction pipeline uses log. These discrepancies produce a training-serving skew that silently degrades model performance.

# GOOD: Single source of truth

# In src/features/build_features.py:
def compute_all_features(
    df: pd.DataFrame,
    reference_date: pd.Timestamp,
) -> pd.DataFrame:
    """Compute all features for churn prediction.

    This function is the SINGLE SOURCE OF TRUTH for feature engineering.
    It is called by both the training pipeline and the prediction pipeline.
    """
    result = df.copy()
    result["tenure_months"] = (reference_date - result["signup_date"]).dt.days / 30
    result["log_revenue"] = np.log1p(result["monthly_revenue"])
    result["engagement_rate"] = result["active_days"] / result["total_days"]
    return result

Both training and prediction import and call the same function. If the feature logic changes, it changes in one place.

The `init.py` Convention

Each subdirectory in src/ needs an __init__.py file to be importable as a Python package. These files can be empty, but they are more useful when they expose a clean public API.

# src/features/__init__.py

from src.features.build_features import (
    build_feature_matrix,
    compute_recency_features,
    compute_frequency_features,
    compute_monetary_features,
    compute_engagement_rate,
)

__all__ = [
    "build_feature_matrix",
    "compute_recency_features",
    "compute_frequency_features",
    "compute_monetary_features",
    "compute_engagement_rate",
]

Now other modules can import directly:

from src.features import build_feature_matrix

Technical Debt in Machine Learning Systems

Key Insight --- Google's 2015 paper "Hidden Technical Debt in Machine Learning Systems" argues that the actual ML code --- the model training and prediction --- is a tiny fraction of a real-world ML system. The surrounding infrastructure (data collection, feature extraction, configuration, monitoring, serving) is vastly larger and harder to maintain. Technical debt accumulates not just in the code, but in the data, the features, the configuration, and the feedback loops.

What Is Technical Debt?

Technical debt is the accumulated cost of shortcuts, quick fixes, and deferred maintenance in a codebase. Like financial debt, it compounds: a shortcut today makes tomorrow's change harder, which makes the next change harder still, until the system is so fragile that any modification risks breaking it.

In traditional software, technical debt is mostly about code. In ML systems, it takes additional forms:

Code-Level Debt

The forms you might expect:

Dead code. Functions, features, and model variants that are no longer used but still present. Nobody knows if they are safe to delete.
Copy-paste code. The same preprocessing logic duplicated in training, evaluation, and serving.
Hardcoded values. Magic numbers (threshold = 0.42) scattered across the codebase with no explanation or centralized configuration.
Monolithic scripts. A 3,000-line train.py that loads data, engineers features, trains the model, evaluates it, serializes it, and deploys it, all in one file.

Data-Level Debt

The forms specific to ML:

Undocumented schema changes. An upstream team renames a column, changes a data type, or modifies a value encoding. Your pipeline silently produces wrong features.
Feature leakage baked into pipelines. A feature that uses future information was added during experimentation and never removed.
Stale data dependencies. Your model depends on a table updated by another team. That team stopped updating the table six months ago. Nobody noticed.
Test data contamination. The train/test split logic was wrong, but the model appeared to perform well, so nobody investigated.

Configuration Debt

The forms you did not see coming:

Entangled hyperparameters. Changing one hyperparameter silently invalidates the optimal values of others. The learning rate depends on the batch size, which depends on the number of features, which depends on the feature selection threshold.
Configuration scattered across files. Feature lists in one YAML, hyperparameters in another, data paths in environment variables, and thresholds in the code itself. No single file shows the complete configuration of the system.
Invisible coupling. Model A's predictions are features for Model B. Retraining Model A changes Model B's inputs, but nobody retrained Model B.

Measuring and Managing Technical Debt

You cannot eliminate technical debt. You manage it. Three practices:

1. Track it explicitly. Maintain a TECH_DEBT.md file or a label in your issue tracker. When you take a shortcut, document it. "TODO" comments do not count --- they are invisible to project management.

# Technical Debt Register

## High Priority
- [ ] Feature engineering duplicated between train.py and predict.py (training-serving skew risk)
- [ ] No integration tests for the data pipeline
- [ ] Model config hardcoded in train.py instead of YAML

## Medium Priority
- [ ] `compute_rfm_features()` has 6 boolean parameters --- refactor to strategy pattern
- [ ] Unused features from v1 model still computed in pipeline

## Low Priority
- [ ] Type hints missing from src/evaluation/
- [ ] Test coverage below 60%

2. Allocate time for debt reduction. A common pattern is the "20% rule": dedicate one day per sprint (or one sprint per quarter) to debt reduction. Without explicit allocation, debt reduction never happens, because there is always a more urgent feature to build.

3. Prevent debt at the boundary. Pre-commit hooks, code review checklists, and CI/CD pipelines catch debt before it enters the codebase. It is easier to prevent a shortcut than to fix it later.

# Example: Configuration management that prevents config debt

# config/model_config.yaml
model:
  type: "gradient_boosting"
  hyperparameters:
    n_estimators: 200
    max_depth: 5
    learning_rate: 0.05
    min_samples_leaf: 20
    subsample: 0.8
  features:
    include:
      - days_since_last_event
      - event_count_30d
      - video_completion_rate
      - plan_type_encoded
      - tenure_months
      - monthly_revenue
    exclude:
      - user_id
      - signup_date
  thresholds:
    churn_probability: 0.42
    min_confidence: 0.6

# src/config/loader.py

import yaml
from pathlib import Path
from dataclasses import dataclass
from typing import Optional


@dataclass
class ModelConfig:
    """Centralized model configuration. Single source of truth."""

    model_type: str
    n_estimators: int
    max_depth: int
    learning_rate: float
    min_samples_leaf: int
    subsample: float
    include_features: list[str]
    exclude_features: list[str]
    churn_threshold: float
    min_confidence: float

    @classmethod
    def from_yaml(cls, path: str | Path) -> "ModelConfig":
        """Load configuration from a YAML file."""
        with open(path) as f:
            raw = yaml.safe_load(f)

        model = raw["model"]
        hp = model["hyperparameters"]
        features = model["features"]
        thresholds = model["thresholds"]

        return cls(
            model_type=model["type"],
            n_estimators=hp["n_estimators"],
            max_depth=hp["max_depth"],
            learning_rate=hp["learning_rate"],
            min_samples_leaf=hp["min_samples_leaf"],
            subsample=hp["subsample"],
            include_features=features["include"],
            exclude_features=features["exclude"],
            churn_threshold=thresholds["churn_probability"],
            min_confidence=thresholds["min_confidence"],
        )

Now every script loads from the same YAML file:

config = ModelConfig.from_yaml("config/model_config.yaml")
model = GradientBoostingClassifier(
    n_estimators=config.n_estimators,
    max_depth=config.max_depth,
    learning_rate=config.learning_rate,
)

No magic numbers. No duplicated configuration. One file to update when hyperparameters change.

Putting It All Together: The Refactored StreamFlow Pipeline

Here is the complete structure of the StreamFlow churn project after refactoring:

streamflow-churn/
    pyproject.toml
    .pre-commit-config.yaml
    .gitignore
    Makefile
    config/
        model_config.yaml
    data/
        raw/
        interim/
        processed/
    models/
    notebooks/
        01-eda.ipynb
        02-feature-exploration.ipynb
    src/
        __init__.py
        config/
            __init__.py
            loader.py
        data/
            __init__.py
            make_dataset.py
            validate.py
        features/
            __init__.py
            build_features.py
            recency.py
            frequency.py
            monetary.py
        models/
            __init__.py
            train_model.py
            predict_model.py
        evaluation/
            __init__.py
            evaluate.py
            metrics.py
    tests/
        __init__.py
        conftest.py
        test_data.py
        test_features.py
        test_models.py
        test_pipeline.py
    TECH_DEBT.md

The Makefile ties it together:

.PHONY: data features train predict evaluate test lint format all clean

data:
    python -m src.data.make_dataset

features:
    python -m src.features.build_features

train:
    python -m src.models.train_model

predict:
    python -m src.models.predict_model

evaluate:
    python -m src.evaluation.evaluate

test:
    pytest tests/ -v --tb=short

lint:
    ruff check src/ tests/
    mypy src/ --ignore-missing-imports

format:
    black src/ tests/
    ruff check --fix src/ tests/

all: data features train evaluate

clean:
    rm -rf data/interim/* data/processed/* models/*

# Reproduce the entire pipeline
make clean && make all

# Run the test suite
make test

# Format and lint before committing
make format && make lint

Recurring Theme: Reproducibility --- The refactored project satisfies the reproducibility test: clone the repo, install dependencies, run make all, and get the same results. No notebooks to run in the right order. No manual steps. No "you need to ask Sarah for the data." The code is the documentation, and the Makefile is the entry point.

Chapter Summary

This chapter covered the five layers of professional data science engineering:

Project structure (cookiecutter-data-science) separates data, code, notebooks, and artifacts into a navigable, reproducible layout.
Version control (git branching, .gitignore, nbstripout) manages code changes while keeping large files and notebook outputs out of the repository.
Testing (pytest, fixtures, parametrize, integration tests) catches bugs before they reach production and gives you confidence to change code.
Code quality (black, ruff, mypy, pre-commit) enforces consistency, catches errors, and automates the boring parts of code review.
Technical debt management (debt registers, configuration management, DRY) prevents the slow decay that makes ML systems unmaintainable.

The transition from "notebook that works on my laptop" to "package that works in production" is not glamorous. It does not improve your AUC. It does not produce a chart for the stakeholder meeting. But it is the difference between a data science prototype and a data science product. The notebook graveyard exists because nobody made this transition. You will.

Next chapter: Chapter 30: ML Experiment Tracking --- tracking every experiment so you can reproduce and compare every result.

In This Chapter

Chapter 29: Software Engineering for Data Scientists

Version Control, Testing, Code Quality, and Technical Debt

Learning Objectives

The Notebook Graveyard

Project Structure: The Cookiecutter Convention

Why This Structure Works

Adapting the Structure

Version Control: Git for ML Projects

The .gitignore for Data Science

Branching Strategy for ML Projects

Commit Messages That Help Future You

Notebook Version Control

Testing Data Science Code

Why Data Science Code Needs Tests

Unit Tests with pytest

Fixtures: Reusable Test Data

Parametrize: Testing Multiple Inputs

Integration Tests for Pipelines

What Not to Test

Code Quality: Automated Formatting, Linting, and Type Checking

Black: The Uncompromising Formatter

Ruff: The Fast Linter

Mypy: Static Type Checking

Pre-Commit Hooks: Automate Everything

Refactoring: From Notebook to Module

The Extraction Pattern

The DRY Principle

The __init__.py Convention

Technical Debt in Machine Learning Systems

What Is Technical Debt?

Code-Level Debt

Data-Level Debt

Configuration Debt

Measuring and Managing Technical Debt

Putting It All Together: The Refactored StreamFlow Pipeline

Chapter Summary

The `init.py` Convention