Case Study 2: The Technical Debt Crisis --- An ML System Nobody Can Maintain
Background
NovaTech is a mid-size e-commerce company with $180 million in annual revenue and 6 million active customers. Two years ago, the company's first data scientist --- we will call her Priya --- built a product recommendation engine. The system worked. Revenue from recommendations grew from $2.1 million to $8.7 million per quarter. Priya was promoted. Then she left.
Eight months later, the recommendation system is the company's most valuable ML system and its most fragile. Nobody on the current three-person data science team wrote the original code. Nobody fully understands it. And three weeks ago, recommendation quality started degrading. Click-through rates dropped 23% over two weeks. Revenue from recommendations fell from $8.7 million to $6.4 million. The VP of Product wants answers.
The data science team lead, Marcus, is asked to diagnose the problem and fix it. He opens the codebase and finds a system that is a textbook case of accumulated technical debt. This case study documents what he found, how he prioritized the debt, and the 8-week remediation plan his team executed.
The Audit: What Marcus Found
The Codebase
recommendation-engine/
train_v1.py # 2,847 lines
train_v2_new.py # 3,102 lines (unclear relation to v1)
train_v2_new_FIXED.py # 3,108 lines
predict.py # 891 lines
predict_fast.py # 743 lines (unclear when to use this vs predict.py)
utils.py # 1,204 lines (everything that didn't fit elsewhere)
config.py # 47 lines (last modified 18 months ago)
features.py # 612 lines
features_v2.py # 589 lines
model.pkl # 1.2 GB (committed to git)
data/
products.csv # 340 MB (committed to git)
user_interactions.csv # 2.1 GB (committed to git --- broke git clone)
notebooks/
exploration.ipynb # 412 cells
exploration_copy.ipynb
test_something.ipynb
requirements.txt # Last updated 14 months ago
README.md # "TODO: write documentation"
Marcus documented 14 distinct technical debt items:
Debt Item 1: Which Training Script Is Production?
Three training scripts exist: train_v1.py, train_v2_new.py, and train_v2_new_FIXED.py. The production cron job runs train_v2_new.py. But train_v2_new_FIXED.py is the most recently modified file. Nobody documented which script should be used or what changed between versions.
# From train_v2_new.py, line 287:
# TODO: Priya - check if this is right, the v1 version used cosine similarity
# but I think dot product is better for this data. Ask Marcus about this.
# (Note: Marcus was not on the team when this was written.)
Category: Code-level debt. Risk: If the production script is not the correct one, the model has been training with a potentially suboptimal or buggy configuration for months.
Debt Item 2: Feature Engineering Divergence
Two feature files exist. The training script imports from features_v2.py. The prediction script imports from features.py. They compute different features.
# features.py (used by predict.py)
def compute_user_features(user_df):
features = {}
features['avg_purchase_value'] = user_df['purchase_amount'].mean()
features['purchase_frequency'] = len(user_df) / 30 # hardcoded 30 days
features['days_since_last_purchase'] = (
pd.Timestamp.now() - user_df['purchase_date'].max()
).days
return features
# features_v2.py (used by train_v2_new.py)
def compute_user_features(user_df, reference_date):
features = {}
features['avg_purchase_value'] = user_df['purchase_amount'].mean()
features['median_purchase_value'] = user_df['purchase_amount'].median() # NEW
features['purchase_frequency'] = len(user_df) / 90 # CHANGED from 30 to 90
features['days_since_last_purchase'] = (
reference_date - user_df['purchase_date'].max()
).days
features['category_diversity'] = user_df['category'].nunique() # NEW
return features
Differences found:
| Feature | features.py (serving) |
features_v2.py (training) |
|---|---|---|
purchase_frequency |
count / 30 |
count / 90 (3x smaller) |
median_purchase_value |
Not computed | Computed |
category_diversity |
Not computed | Computed |
days_since_last_purchase |
Uses Timestamp.now() |
Uses reference_date parameter |
Category: Data-level debt + code-level debt.
Risk: This is active training-serving skew. The model was trained on features that differ from the features it receives in production. The purchase_frequency values are 3x smaller in serving than in training. The model also expects median_purchase_value and category_diversity at inference time, but the serving pipeline does not compute them --- meaning either the model ignores them (unlikely) or the serving code silently fills them with defaults.
This is almost certainly the cause of the 23% CTR drop. When Priya added new features to v2 training but did not update the serving pipeline, the model's predictions became inconsistent with its training data.
Debt Item 3: The 2.1 GB CSV in Git
The user interactions dataset was committed directly to the git repository. This means:
git clonedownloads 2.1 GB before the codebase is usable- The file is in git history permanently, even if deleted
- The repository is 2.4 GB total (the data is 87% of it)
- Git operations (status, diff, log) are noticeably slow
Category: Code-level debt. Risk: Slow onboarding, difficulty collaborating, and the risk that someone runs the model on stale data because the CSV is 8 months old.
Debt Item 4: The 1.2 GB Pickle File
The trained model is serialized as a pickle file and committed to git. Pickle files are binary, opaque, and version-sensitive: a model pickled with scikit-learn 1.2 may not unpickle with scikit-learn 1.5.
Category: Code-level debt. Risk: The model becomes unloadable after a dependency upgrade. Nobody knows which scikit-learn version produced the current model.
Debt Item 5: Stale Dependencies
# requirements.txt (last updated 14 months ago)
pandas==1.5.3
scikit-learn==1.2.2
numpy==1.24.3
flask==2.3.2
The team's development environments have newer versions. The production server was rebuilt last month and installed the latest versions. Nobody tested the code against the pinned versions.
Category: Configuration debt. Risk: Silent behavioral changes between library versions. Scikit-learn's default parameters for some estimators changed between 1.2 and 1.5. If the model was trained on 1.2 defaults but serves on 1.5 defaults, predictions differ.
Debt Item 6: Hardcoded Hyperparameters
# train_v2_new.py, line 312
model = NearestNeighbors(
n_neighbors=47, # Why 47? No comment, no config file.
metric='cosine', # Or is it dot product? See TODO on line 287.
algorithm='brute',
)
# train_v2_new.py, line 458
SIMILARITY_THRESHOLD = 0.237 # Magic number. No explanation.
# predict.py, line 89
TOP_K = 12 # Why 12? Different from the 10 used in train_v2_new.py.
Category: Configuration debt.
Risk: Nobody knows the rationale for these values. Changing one requires reading thousands of lines to understand its impact. The discrepancy between TOP_K = 12 in prediction and TOP_K = 10 in training suggests a copy-paste error.
Debt Item 7: No Tests
Zero tests. No unit tests, no integration tests, no smoke tests. The only validation is running the training script and checking the output manually.
Category: Code-level debt. Risk: Any code change could break the system silently. The feature divergence (Debt Item 2) would have been caught by a test comparing training and serving feature schemas.
Debt Items 8--14: Additional Issues
| # | Item | Category | Risk |
|---|---|---|---|
| 8 | No logging (uses print statements) | Code | Cannot diagnose production issues |
| 9 | utils.py is 1,204 lines of unrelated functions |
Code | Impossible to navigate or test |
| 10 | No data validation (schema, nulls, ranges) | Data | Upstream changes break pipeline silently |
| 11 | Model retrained monthly by cron, no monitoring | Data/Config | Model decay goes undetected |
| 12 | No experiment tracking | Config | Cannot reproduce or compare results |
| 13 | README says "TODO" | Code | New team members cannot onboard |
| 14 | exploration.ipynb has 412 cells with production logic |
Code | Critical logic trapped in notebook |
The Triage: Prioritizing Debt
Marcus categorized each item by severity and urgency:
| Priority | Debt Item | Rationale |
|---|---|---|
| P0 (fix now) | #2: Feature divergence | Actively causing the CTR drop |
| P0 (fix now) | #1: Which training script? | Must know what is running in prod |
| P1 (this sprint) | #7: No tests | Cannot safely fix anything without tests |
| P1 (this sprint) | #5: Stale dependencies | Blocks environment reproducibility |
| P2 (next sprint) | #6: Hardcoded hyperparameters | Slows iteration but not actively broken |
| P2 (next sprint) | #10: No data validation | Prevents future silent failures |
| P2 (next sprint) | #3, #4: Large files in git | Painful but not breaking |
| P3 (backlog) | #8, #9, #11, #12, #13, #14 | Important but non-urgent |
Key Principle --- Triage is not about fixing everything. It is about fixing the things that are actively losing the company money (P0), then building the safety nets that prevent future breakage (P1), then cleaning up the rest incrementally.
The Fix: Week by Week
Week 1: Stop the Bleeding
Goal: Fix the feature divergence and restore CTR.
# Step 1: Determine which features the production model actually expects.
import pickle
with open("model.pkl", "rb") as f:
model = pickle.load(f)
# NearestNeighbors doesn't have feature_names_in_, but we can check
# the training data dimensions
print(f"Model expects {model.n_features_in_} features")
# Output: Model expects 7 features
# Step 2: Map the 7 features to their sources.
# By reading train_v2_new.py line by line:
TRAINING_FEATURES = [
"avg_purchase_value",
"median_purchase_value", # Missing from predict.py
"purchase_frequency", # Different calculation in predict.py
"days_since_last_purchase",
"category_diversity", # Missing from predict.py
"avg_session_duration",
"page_views_per_session",
]
# Step 3: Update predict.py to use features_v2.py
# (Temporary fix --- proper fix is a shared feature module)
# Before (predict.py, line 12):
from features import compute_user_features
# After (predict.py, line 12):
from features_v2 import compute_user_features
# Step 4: Fix the purchase_frequency denominator
# features_v2.py uses 90 days. Verify this matches training data.
# Read Priya's notebook (exploration.ipynb, cell 287):
# "Using 90-day window for purchase frequency because avg customer
# purchase cycle is 45 days, need at least 2 cycles for stability."
# Confirmed: 90 is correct. features.py had the bug (30 days).
Result: After deploying the fixed prediction pipeline, CTR recovered to within 3% of baseline within 48 hours. The remaining 3% gap was attributed to genuine model decay (8 months since last retraining with updated features).
Week 2: Establish the Test Safety Net
# tests/test_feature_parity.py
"""Ensure training and serving compute identical features."""
import pytest
from features_v2 import compute_user_features
class TestFeatureParity:
"""These tests exist because the absence of tests caused a $2.3M revenue drop."""
def test_feature_count_matches_model(self, trained_model, sample_user_data):
features = compute_user_features(sample_user_data, reference_date)
assert len(features) == trained_model.n_features_in_
def test_feature_names_match(self, sample_user_data, reference_date):
features = compute_user_features(sample_user_data, reference_date)
expected = [
"avg_purchase_value", "median_purchase_value",
"purchase_frequency", "days_since_last_purchase",
"category_diversity", "avg_session_duration",
"page_views_per_session",
]
assert list(features.keys()) == expected
def test_purchase_frequency_uses_90_day_window(self, sample_user_data, reference_date):
"""Regression test: features.py had a bug using 30-day window."""
features = compute_user_features(sample_user_data, reference_date)
n_purchases = len(sample_user_data)
expected = n_purchases / 90
assert features["purchase_frequency"] == pytest.approx(expected)
$ pytest tests/ -v
tests/test_feature_parity.py::TestFeatureParity::test_feature_count_matches_model PASSED
tests/test_feature_parity.py::TestFeatureParity::test_feature_names_match PASSED
tests/test_feature_parity.py::TestFeatureParity::test_purchase_frequency_uses_90_day_window PASSED
Weeks 3--4: Restructure the Codebase
Marcus led the team through a full restructuring following cookiecutter conventions:
# Before: 14 files, no structure
recommendation-engine/
train_v1.py
train_v2_new.py
train_v2_new_FIXED.py
predict.py
predict_fast.py
utils.py
config.py
features.py
features_v2.py
model.pkl
data/...
notebooks/...
requirements.txt
README.md
# After: 28 files, clear structure
recommendation-engine/
pyproject.toml
Makefile
.pre-commit-config.yaml
.gitignore
config/
model_config.yaml
feature_config.yaml
data/
raw/.gitkeep # Data fetched from S3, not committed
processed/.gitkeep
models/.gitkeep # Models stored in S3, not committed
notebooks/
01-exploration.ipynb # Cleaned, imports from src/
src/
__init__.py
data/
__init__.py
fetch_data.py
validate.py
features/
__init__.py
user_features.py # Single source of truth (was features_v2.py)
product_features.py
models/
__init__.py
train.py # Single training script (was 3 scripts)
predict.py
evaluate.py
config/
__init__.py
loader.py
tests/
__init__.py
conftest.py
test_features.py
test_model.py
test_pipeline.py
test_data_validation.py
TECH_DEBT.md
Key decisions:
train_v1.pyarchived to a git tag, then deleted. It was the original version and is never needed.train_v2_new_FIXED.pywas diff'd againsttrain_v2_new.py: the fix was a 3-line change to handle null values. The fix was applied to the canonicalsrc/models/train.py.predict_fast.pywas diff'd againstpredict.py: it was an optimization that batched predictions instead of running one-at-a-time. The batched version became the canonicalsrc/models/predict.py.utils.pywas split into 4 modules based on functionality.- The 2.1 GB CSV and 1.2 GB pickle were removed from git using
git filter-branch(after backing up). Data and models now live in S3, fetched bysrc/data/fetch_data.py.
Weeks 5--6: Configuration and Validation
# config/model_config.yaml
model:
type: nearest_neighbors
hyperparameters:
n_neighbors: 47 # Tuned by Priya, 2024-Q1 (see experiment log #23)
metric: cosine # Cosine chosen over dot product (A/B test, 2024-02)
algorithm: brute # Brute force required for cosine metric
serving:
top_k: 10 # Standardized (was 10 in training, 12 in serving)
similarity_threshold: 0.237 # Min score to include in recommendations
features:
user:
- avg_purchase_value
- median_purchase_value
- purchase_frequency # 90-day window, NOT 30-day
- days_since_last_purchase
- category_diversity
- avg_session_duration
- page_views_per_session
observation_window_days: 90
# src/data/validate.py
"""Data validation for the recommendation pipeline."""
import pandas as pd
import logging
logger = logging.getLogger(__name__)
class DataValidationError(Exception):
"""Raised when input data fails validation checks."""
pass
def validate_interactions(df: pd.DataFrame) -> None:
"""Validate the user interactions DataFrame.
Checks:
- Required columns exist
- No null user_ids or product_ids
- purchase_amount is non-negative
- purchase_date is a valid datetime
- No duplicate (user_id, product_id, purchase_date) rows
Raises:
DataValidationError: If any check fails.
"""
required = ["user_id", "product_id", "purchase_date", "purchase_amount",
"category", "session_duration", "page_views"]
missing = set(required) - set(df.columns)
if missing:
raise DataValidationError(f"Missing columns: {missing}")
null_users = df["user_id"].isnull().sum()
if null_users > 0:
raise DataValidationError(f"{null_users} rows have null user_id")
negative_amounts = (df["purchase_amount"] < 0).sum()
if negative_amounts > 0:
raise DataValidationError(
f"{negative_amounts} rows have negative purchase_amount"
)
n_before = len(df)
n_after = df.drop_duplicates(
subset=["user_id", "product_id", "purchase_date"]
).shape[0]
if n_before != n_after:
logger.warning(
f"{n_before - n_after} duplicate rows detected. "
"Deduplication required before training."
)
logger.info(f"Validation passed: {len(df)} rows, {df['user_id'].nunique()} users")
Weeks 7--8: Monitoring and Documentation
Marcus added basic monitoring (detailed monitoring is covered in Chapter 32) and wrote the documentation that should have existed from day one:
# src/models/evaluate.py (excerpt)
def compute_recommendation_metrics(
recommendations: pd.DataFrame,
ground_truth: pd.DataFrame,
k: int = 10,
) -> dict[str, float]:
"""Compute recommendation quality metrics.
Args:
recommendations: DataFrame with user_id and recommended product_ids.
ground_truth: DataFrame with user_id and actually purchased product_ids.
k: Number of top recommendations to evaluate.
Returns:
Dictionary with precision@k, recall@k, ndcg@k, and coverage.
"""
# ... implementation ...
return {
"precision_at_k": precision,
"recall_at_k": recall,
"ndcg_at_k": ndcg,
"catalog_coverage": coverage,
}
Lessons Learned
The Cost of Inaction
Marcus estimated the financial impact of the technical debt:
| Item | Cost |
|---|---|
| Revenue lost during CTR drop (3 weeks at -23%) | $1.5M |
| Engineering time to diagnose (3 people, 1 week) | $45K |
| 8-week remediation effort (3 people, partial time) | $180K |
| Opportunity cost (projects delayed during remediation) | ~$200K |
| Total estimated cost | ~$1.9M |
The feature divergence alone --- a bug that a single integration test would have caught --- cost the company $1.5 million in lost revenue.
The Five Warning Signs
Marcus distilled his experience into five warning signs that technical debt in an ML system has reached critical levels:
-
Nobody knows which code is running in production. Multiple script versions, unclear deployment history, no CI/CD pipeline that forces a single path to production.
-
Feature engineering is duplicated between training and serving. This is the single most dangerous form of ML technical debt. It will cause training-serving skew. It is not a matter of if, but when.
-
The original author has left and nobody can explain the system. If the system's knowledge lives in one person's head, it does not live in the system. Documentation, tests, and clear code are the only remedies.
-
There are zero tests. Not "we have a few tests." Zero. This means every code change is a gamble, and nobody changes code unless absolutely forced to, which means bugs accumulate.
-
Magic numbers have no comments. If a threshold, hyperparameter, or constant appears in the code without explanation, it is a ticking time bomb. When it needs to change, nobody will know what it does, what depends on it, or what valid values look like.
The Prevention Framework
After the remediation, Marcus's team adopted three practices to prevent debt reaccumulation:
1. The "feature parity test." Every CI/CD pipeline includes a test that generates features using the training code path and the serving code path, then asserts they produce identical outputs. This test is marked as blocking --- the pipeline cannot deploy if it fails.
2. The "bus factor review." Every quarter, the team identifies systems where only one person understands the code. That person is required to pair-program with another team member, write documentation, and add tests until the bus factor is at least two.
3. The "debt budget." Twenty percent of each sprint is reserved for technical debt reduction. The debt register (TECH_DEBT.md) is reviewed in sprint planning, and items are treated as first-class work items with estimates and acceptance criteria.
Discussion Questions
-
The feature divergence bug (Debt Item 2) existed for approximately 8 months before it was detected. What specific monitoring metric would have caught it earlier? (Hint: think about input feature distributions, not just output metrics.)
-
Marcus's team spent 8 weeks on remediation. A product manager argues this time would have been better spent building new features. How would you calculate the ROI of the remediation to make the business case?
-
Priya, the original author, was promoted and then left. What organizational practices could prevent critical system knowledge from leaving with a single person?
-
The team chose to fix the feature divergence by updating the serving code to match the training code (rather than retraining the model to match the serving code). What are the tradeoffs of each approach? Under what circumstances would the opposite choice be better?
-
NovaTech's recommendation system had zero tests. If you could add only three tests to this system, which three would provide the most protection? Justify your choices.
Case Study 2 accompanies Chapter 29: Software Engineering for Data Scientists. See index.md for the full chapter.