Case Study 2: The Technical Debt Crisis --- An ML System Nobody Can Maintain

DataField.Dev

An ML System Nobody Can Maintain" type: case-study chapter: 29 part: 6

Case Study 2: The Technical Debt Crisis --- An ML System Nobody Can Maintain

Background

NovaTech is a mid-size e-commerce company with $180 million in annual revenue and 6 million active customers. Two years ago, the company's first data scientist --- we will call her Priya --- built a product recommendation engine. The system worked. Revenue from recommendations grew from $2.1 million to $8.7 million per quarter. Priya was promoted. Then she left.

Eight months later, the recommendation system is the company's most valuable ML system and its most fragile. Nobody on the current three-person data science team wrote the original code. Nobody fully understands it. And three weeks ago, recommendation quality started degrading. Click-through rates dropped 23% over two weeks. Revenue from recommendations fell from $8.7 million to $6.4 million. The VP of Product wants answers.

The data science team lead, Marcus, is asked to diagnose the problem and fix it. He opens the codebase and finds a system that is a textbook case of accumulated technical debt. This case study documents what he found, how he prioritized the debt, and the 8-week remediation plan his team executed.

The Audit: What Marcus Found

The Codebase

recommendation-engine/
    train_v1.py                     # 2,847 lines
    train_v2_new.py                 # 3,102 lines (unclear relation to v1)
    train_v2_new_FIXED.py           # 3,108 lines
    predict.py                      # 891 lines
    predict_fast.py                 # 743 lines (unclear when to use this vs predict.py)
    utils.py                        # 1,204 lines (everything that didn't fit elsewhere)
    config.py                       # 47 lines (last modified 18 months ago)
    features.py                     # 612 lines
    features_v2.py                  # 589 lines
    model.pkl                       # 1.2 GB (committed to git)
    data/
        products.csv                # 340 MB (committed to git)
        user_interactions.csv       # 2.1 GB (committed to git --- broke git clone)
    notebooks/
        exploration.ipynb           # 412 cells
        exploration_copy.ipynb
        test_something.ipynb
    requirements.txt                # Last updated 14 months ago
    README.md                       # "TODO: write documentation"

Marcus documented 14 distinct technical debt items:

Debt Item 1: Which Training Script Is Production?

Three training scripts exist: train_v1.py, train_v2_new.py, and train_v2_new_FIXED.py. The production cron job runs train_v2_new.py. But train_v2_new_FIXED.py is the most recently modified file. Nobody documented which script should be used or what changed between versions.

# From train_v2_new.py, line 287:
# TODO: Priya - check if this is right, the v1 version used cosine similarity
# but I think dot product is better for this data. Ask Marcus about this.
# (Note: Marcus was not on the team when this was written.)

Category: Code-level debt. Risk: If the production script is not the correct one, the model has been training with a potentially suboptimal or buggy configuration for months.

Debt Item 2: Feature Engineering Divergence

Two feature files exist. The training script imports from features_v2.py. The prediction script imports from features.py. They compute different features.

# features.py (used by predict.py)
def compute_user_features(user_df):
    features = {}
    features['avg_purchase_value'] = user_df['purchase_amount'].mean()
    features['purchase_frequency'] = len(user_df) / 30  # hardcoded 30 days
    features['days_since_last_purchase'] = (
        pd.Timestamp.now() - user_df['purchase_date'].max()
    ).days
    return features

# features_v2.py (used by train_v2_new.py)
def compute_user_features(user_df, reference_date):
    features = {}
    features['avg_purchase_value'] = user_df['purchase_amount'].mean()
    features['median_purchase_value'] = user_df['purchase_amount'].median()  # NEW
    features['purchase_frequency'] = len(user_df) / 90  # CHANGED from 30 to 90
    features['days_since_last_purchase'] = (
        reference_date - user_df['purchase_date'].max()
    ).days
    features['category_diversity'] = user_df['category'].nunique()  # NEW
    return features

Differences found:

Feature	`features.py` (serving)	`features_v2.py` (training)
`purchase_frequency`	`count / 30`	`count / 90` (3x smaller)
`median_purchase_value`	Not computed	Computed
`category_diversity`	Not computed	Computed
`days_since_last_purchase`	Uses `Timestamp.now()`	Uses `reference_date` parameter

Category: Data-level debt + code-level debt. Risk: This is active training-serving skew. The model was trained on features that differ from the features it receives in production. The purchase_frequency values are 3x smaller in serving than in training. The model also expects median_purchase_value and category_diversity at inference time, but the serving pipeline does not compute them --- meaning either the model ignores them (unlikely) or the serving code silently fills them with defaults.

This is almost certainly the cause of the 23% CTR drop. When Priya added new features to v2 training but did not update the serving pipeline, the model's predictions became inconsistent with its training data.

Debt Item 3: The 2.1 GB CSV in Git

The user interactions dataset was committed directly to the git repository. This means:

git clone downloads 2.1 GB before the codebase is usable
The file is in git history permanently, even if deleted
The repository is 2.4 GB total (the data is 87% of it)
Git operations (status, diff, log) are noticeably slow

Category: Code-level debt. Risk: Slow onboarding, difficulty collaborating, and the risk that someone runs the model on stale data because the CSV is 8 months old.

Debt Item 4: The 1.2 GB Pickle File

The trained model is serialized as a pickle file and committed to git. Pickle files are binary, opaque, and version-sensitive: a model pickled with scikit-learn 1.2 may not unpickle with scikit-learn 1.5.

Category: Code-level debt. Risk: The model becomes unloadable after a dependency upgrade. Nobody knows which scikit-learn version produced the current model.

Debt Item 5: Stale Dependencies

# requirements.txt (last updated 14 months ago)
pandas==1.5.3
scikit-learn==1.2.2
numpy==1.24.3
flask==2.3.2

The team's development environments have newer versions. The production server was rebuilt last month and installed the latest versions. Nobody tested the code against the pinned versions.

Category: Configuration debt. Risk: Silent behavioral changes between library versions. Scikit-learn's default parameters for some estimators changed between 1.2 and 1.5. If the model was trained on 1.2 defaults but serves on 1.5 defaults, predictions differ.

Debt Item 6: Hardcoded Hyperparameters

# train_v2_new.py, line 312
model = NearestNeighbors(
    n_neighbors=47,     # Why 47? No comment, no config file.
    metric='cosine',    # Or is it dot product? See TODO on line 287.
    algorithm='brute',
)

# train_v2_new.py, line 458
SIMILARITY_THRESHOLD = 0.237  # Magic number. No explanation.

# predict.py, line 89
TOP_K = 12  # Why 12? Different from the 10 used in train_v2_new.py.

Category: Configuration debt. Risk: Nobody knows the rationale for these values. Changing one requires reading thousands of lines to understand its impact. The discrepancy between TOP_K = 12 in prediction and TOP_K = 10 in training suggests a copy-paste error.

Debt Item 7: No Tests

Zero tests. No unit tests, no integration tests, no smoke tests. The only validation is running the training script and checking the output manually.

Category: Code-level debt. Risk: Any code change could break the system silently. The feature divergence (Debt Item 2) would have been caught by a test comparing training and serving feature schemas.

Debt Items 8--14: Additional Issues

#	Item	Category	Risk
8	No logging (uses print statements)	Code	Cannot diagnose production issues
9	`utils.py` is 1,204 lines of unrelated functions	Code	Impossible to navigate or test
10	No data validation (schema, nulls, ranges)	Data	Upstream changes break pipeline silently
11	Model retrained monthly by cron, no monitoring	Data/Config	Model decay goes undetected
12	No experiment tracking	Config	Cannot reproduce or compare results
13	README says "TODO"	Code	New team members cannot onboard
14	`exploration.ipynb` has 412 cells with production logic	Code	Critical logic trapped in notebook

The Triage: Prioritizing Debt

Marcus categorized each item by severity and urgency:

Priority	Debt Item	Rationale
P0 (fix now)	#2: Feature divergence	Actively causing the CTR drop
P0 (fix now)	#1: Which training script?	Must know what is running in prod
P1 (this sprint)	#7: No tests	Cannot safely fix anything without tests
P1 (this sprint)	#5: Stale dependencies	Blocks environment reproducibility
P2 (next sprint)	#6: Hardcoded hyperparameters	Slows iteration but not actively broken
P2 (next sprint)	#10: No data validation	Prevents future silent failures
P2 (next sprint)	#3, #4: Large files in git	Painful but not breaking
P3 (backlog)	#8, #9, #11, #12, #13, #14	Important but non-urgent

Key Principle --- Triage is not about fixing everything. It is about fixing the things that are actively losing the company money (P0), then building the safety nets that prevent future breakage (P1), then cleaning up the rest incrementally.

The Fix: Week by Week

Week 1: Stop the Bleeding

Goal: Fix the feature divergence and restore CTR.

# Step 1: Determine which features the production model actually expects.
import pickle

with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# NearestNeighbors doesn't have feature_names_in_, but we can check
# the training data dimensions
print(f"Model expects {model.n_features_in_} features")
# Output: Model expects 7 features

# Step 2: Map the 7 features to their sources.
# By reading train_v2_new.py line by line:
TRAINING_FEATURES = [
    "avg_purchase_value",
    "median_purchase_value",     # Missing from predict.py
    "purchase_frequency",        # Different calculation in predict.py
    "days_since_last_purchase",
    "category_diversity",        # Missing from predict.py
    "avg_session_duration",
    "page_views_per_session",
]

# Step 3: Update predict.py to use features_v2.py
# (Temporary fix --- proper fix is a shared feature module)

# Before (predict.py, line 12):
from features import compute_user_features

# After (predict.py, line 12):
from features_v2 import compute_user_features

# Step 4: Fix the purchase_frequency denominator
# features_v2.py uses 90 days. Verify this matches training data.
# Read Priya's notebook (exploration.ipynb, cell 287):
# "Using 90-day window for purchase frequency because avg customer
# purchase cycle is 45 days, need at least 2 cycles for stability."
# Confirmed: 90 is correct. features.py had the bug (30 days).

Result: After deploying the fixed prediction pipeline, CTR recovered to within 3% of baseline within 48 hours. The remaining 3% gap was attributed to genuine model decay (8 months since last retraining with updated features).

Week 2: Establish the Test Safety Net

# tests/test_feature_parity.py
"""Ensure training and serving compute identical features."""

import pytest
from features_v2 import compute_user_features


class TestFeatureParity:
    """These tests exist because the absence of tests caused a $2.3M revenue drop."""

    def test_feature_count_matches_model(self, trained_model, sample_user_data):
        features = compute_user_features(sample_user_data, reference_date)
        assert len(features) == trained_model.n_features_in_

    def test_feature_names_match(self, sample_user_data, reference_date):
        features = compute_user_features(sample_user_data, reference_date)
        expected = [
            "avg_purchase_value", "median_purchase_value",
            "purchase_frequency", "days_since_last_purchase",
            "category_diversity", "avg_session_duration",
            "page_views_per_session",
        ]
        assert list(features.keys()) == expected

    def test_purchase_frequency_uses_90_day_window(self, sample_user_data, reference_date):
        """Regression test: features.py had a bug using 30-day window."""
        features = compute_user_features(sample_user_data, reference_date)
        n_purchases = len(sample_user_data)
        expected = n_purchases / 90
        assert features["purchase_frequency"] == pytest.approx(expected)

$ pytest tests/ -v
tests/test_feature_parity.py::TestFeatureParity::test_feature_count_matches_model PASSED
tests/test_feature_parity.py::TestFeatureParity::test_feature_names_match PASSED
tests/test_feature_parity.py::TestFeatureParity::test_purchase_frequency_uses_90_day_window PASSED

Weeks 3--4: Restructure the Codebase

Marcus led the team through a full restructuring following cookiecutter conventions:

# Before: 14 files, no structure
recommendation-engine/
    train_v1.py
    train_v2_new.py
    train_v2_new_FIXED.py
    predict.py
    predict_fast.py
    utils.py
    config.py
    features.py
    features_v2.py
    model.pkl
    data/...
    notebooks/...
    requirements.txt
    README.md

# After: 28 files, clear structure
recommendation-engine/
    pyproject.toml
    Makefile
    .pre-commit-config.yaml
    .gitignore
    config/
        model_config.yaml
        feature_config.yaml
    data/
        raw/.gitkeep              # Data fetched from S3, not committed
        processed/.gitkeep
    models/.gitkeep               # Models stored in S3, not committed
    notebooks/
        01-exploration.ipynb      # Cleaned, imports from src/
    src/
        __init__.py
        data/
            __init__.py
            fetch_data.py
            validate.py
        features/
            __init__.py
            user_features.py      # Single source of truth (was features_v2.py)
            product_features.py
        models/
            __init__.py
            train.py              # Single training script (was 3 scripts)
            predict.py
            evaluate.py
        config/
            __init__.py
            loader.py
    tests/
        __init__.py
        conftest.py
        test_features.py
        test_model.py
        test_pipeline.py
        test_data_validation.py
    TECH_DEBT.md

Key decisions:

train_v1.py archived to a git tag, then deleted. It was the original version and is never needed.
train_v2_new_FIXED.py was diff'd against train_v2_new.py: the fix was a 3-line change to handle null values. The fix was applied to the canonical src/models/train.py.
predict_fast.py was diff'd against predict.py: it was an optimization that batched predictions instead of running one-at-a-time. The batched version became the canonical src/models/predict.py.
utils.py was split into 4 modules based on functionality.
The 2.1 GB CSV and 1.2 GB pickle were removed from git using git filter-branch (after backing up). Data and models now live in S3, fetched by src/data/fetch_data.py.

Weeks 5--6: Configuration and Validation

# config/model_config.yaml
model:
  type: nearest_neighbors
  hyperparameters:
    n_neighbors: 47         # Tuned by Priya, 2024-Q1 (see experiment log #23)
    metric: cosine           # Cosine chosen over dot product (A/B test, 2024-02)
    algorithm: brute         # Brute force required for cosine metric
  serving:
    top_k: 10               # Standardized (was 10 in training, 12 in serving)
    similarity_threshold: 0.237  # Min score to include in recommendations

features:
  user:
    - avg_purchase_value
    - median_purchase_value
    - purchase_frequency       # 90-day window, NOT 30-day
    - days_since_last_purchase
    - category_diversity
    - avg_session_duration
    - page_views_per_session
  observation_window_days: 90

# src/data/validate.py

"""Data validation for the recommendation pipeline."""

import pandas as pd
import logging

logger = logging.getLogger(__name__)


class DataValidationError(Exception):
    """Raised when input data fails validation checks."""
    pass


def validate_interactions(df: pd.DataFrame) -> None:
    """Validate the user interactions DataFrame.

    Checks:
    - Required columns exist
    - No null user_ids or product_ids
    - purchase_amount is non-negative
    - purchase_date is a valid datetime
    - No duplicate (user_id, product_id, purchase_date) rows

    Raises:
        DataValidationError: If any check fails.
    """
    required = ["user_id", "product_id", "purchase_date", "purchase_amount",
                "category", "session_duration", "page_views"]
    missing = set(required) - set(df.columns)
    if missing:
        raise DataValidationError(f"Missing columns: {missing}")

    null_users = df["user_id"].isnull().sum()
    if null_users > 0:
        raise DataValidationError(f"{null_users} rows have null user_id")

    negative_amounts = (df["purchase_amount"] < 0).sum()
    if negative_amounts > 0:
        raise DataValidationError(
            f"{negative_amounts} rows have negative purchase_amount"
        )

    n_before = len(df)
    n_after = df.drop_duplicates(
        subset=["user_id", "product_id", "purchase_date"]
    ).shape[0]
    if n_before != n_after:
        logger.warning(
            f"{n_before - n_after} duplicate rows detected. "
            "Deduplication required before training."
        )

    logger.info(f"Validation passed: {len(df)} rows, {df['user_id'].nunique()} users")

Weeks 7--8: Monitoring and Documentation

Marcus added basic monitoring (detailed monitoring is covered in Chapter 32) and wrote the documentation that should have existed from day one:

# src/models/evaluate.py (excerpt)

def compute_recommendation_metrics(
    recommendations: pd.DataFrame,
    ground_truth: pd.DataFrame,
    k: int = 10,
) -> dict[str, float]:
    """Compute recommendation quality metrics.

    Args:
        recommendations: DataFrame with user_id and recommended product_ids.
        ground_truth: DataFrame with user_id and actually purchased product_ids.
        k: Number of top recommendations to evaluate.

    Returns:
        Dictionary with precision@k, recall@k, ndcg@k, and coverage.
    """
    # ... implementation ...
    return {
        "precision_at_k": precision,
        "recall_at_k": recall,
        "ndcg_at_k": ndcg,
        "catalog_coverage": coverage,
    }

Lessons Learned

The Cost of Inaction

Marcus estimated the financial impact of the technical debt:

Item	Cost
Revenue lost during CTR drop (3 weeks at -23%)	$1.5M
Engineering time to diagnose (3 people, 1 week)	$45K
8-week remediation effort (3 people, partial time)	$180K
Opportunity cost (projects delayed during remediation)	~$200K
Total estimated cost	~$1.9M

The feature divergence alone --- a bug that a single integration test would have caught --- cost the company $1.5 million in lost revenue.

The Five Warning Signs

Marcus distilled his experience into five warning signs that technical debt in an ML system has reached critical levels:

Nobody knows which code is running in production. Multiple script versions, unclear deployment history, no CI/CD pipeline that forces a single path to production.
Feature engineering is duplicated between training and serving. This is the single most dangerous form of ML technical debt. It will cause training-serving skew. It is not a matter of if, but when.
The original author has left and nobody can explain the system. If the system's knowledge lives in one person's head, it does not live in the system. Documentation, tests, and clear code are the only remedies.
There are zero tests. Not "we have a few tests." Zero. This means every code change is a gamble, and nobody changes code unless absolutely forced to, which means bugs accumulate.
Magic numbers have no comments. If a threshold, hyperparameter, or constant appears in the code without explanation, it is a ticking time bomb. When it needs to change, nobody will know what it does, what depends on it, or what valid values look like.

The Prevention Framework

After the remediation, Marcus's team adopted three practices to prevent debt reaccumulation:

1. The "feature parity test." Every CI/CD pipeline includes a test that generates features using the training code path and the serving code path, then asserts they produce identical outputs. This test is marked as blocking --- the pipeline cannot deploy if it fails.

2. The "bus factor review." Every quarter, the team identifies systems where only one person understands the code. That person is required to pair-program with another team member, write documentation, and add tests until the bus factor is at least two.

3. The "debt budget." Twenty percent of each sprint is reserved for technical debt reduction. The debt register (TECH_DEBT.md) is reviewed in sprint planning, and items are treated as first-class work items with estimates and acceptance criteria.

Discussion Questions

The feature divergence bug (Debt Item 2) existed for approximately 8 months before it was detected. What specific monitoring metric would have caught it earlier? (Hint: think about input feature distributions, not just output metrics.)
Marcus's team spent 8 weeks on remediation. A product manager argues this time would have been better spent building new features. How would you calculate the ROI of the remediation to make the business case?
Priya, the original author, was promoted and then left. What organizational practices could prevent critical system knowledge from leaving with a single person?
The team chose to fix the feature divergence by updating the serving code to match the training code (rather than retraining the model to match the serving code). What are the tradeoffs of each approach? Under what circumstances would the opposite choice be better?
NovaTech's recommendation system had zero tests. If you could add only three tests to this system, which three would provide the most protection? Justify your choices.

Case Study 2 accompanies Chapter 29: Software Engineering for Data Scientists. See index.md for the full chapter.