Exercises: Chapter 29
Software Engineering for Data Scientists
Exercise 1: Project Structure Audit (Conceptual + Code)
You inherit a data science project with the following directory structure:
churn_project/
data.csv
churn_model_v1.ipynb
churn_model_v2.ipynb
churn_model_FINAL.ipynb
churn_model_FINAL_fixed.ipynb
model.pkl
utils.py
requirements.txt
output.csv
figures/
roc_curve.png
feature_importance.png
old_stuff/
churn_model_old.ipynb
data_backup.csv
a) List every structural problem with this project. Consider reproducibility, collaboration, version control, data management, and maintainability.
b) Propose a refactored directory structure following cookiecutter-data-science conventions. For each file or directory in the original, state where it should go in the new structure, and explain why.
c) Write a .gitignore file for the refactored project. For each entry, add a comment explaining what it excludes and why.
d) The utils.py file contains 14 functions: 5 for data cleaning, 4 for feature engineering, 3 for visualization, and 2 for model evaluation. How would you split this into multiple modules? Write the __init__.py for each new module, exposing an appropriate public API.
e) A teammate argues: "This restructuring is going to take two days and it doesn't improve the model at all. We should be tuning hyperparameters instead." Write a response (3-5 sentences) explaining why the restructuring is worth the investment.
Exercise 2: Writing Your First Tests (Code)
Below is a feature engineering function extracted from a notebook. Write a complete test file for it.
# src/features/engagement.py
import pandas as pd
import numpy as np
def compute_engagement_features(
events: pd.DataFrame,
reference_date: pd.Timestamp,
window_days: int = 30,
) -> pd.DataFrame:
"""Compute engagement features per user over a rolling window.
Args:
events: DataFrame with columns ['user_id', 'timestamp', 'event_type'].
reference_date: End of the observation window.
window_days: Number of days to look back from reference_date.
Returns:
DataFrame with columns:
- user_id
- event_count: total events in window
- active_days: distinct days with at least one event
- engagement_rate: active_days / window_days
- favorite_event_type: most frequent event type
"""
window_start = reference_date - pd.Timedelta(days=window_days)
windowed = events[
(events["timestamp"] >= window_start)
& (events["timestamp"] < reference_date)
].copy()
if windowed.empty:
return pd.DataFrame(columns=[
"user_id", "event_count", "active_days",
"engagement_rate", "favorite_event_type",
])
windowed["date"] = windowed["timestamp"].dt.date
event_count = windowed.groupby("user_id").size().reset_index(name="event_count")
active_days = (
windowed.groupby("user_id")["date"]
.nunique()
.reset_index(name="active_days")
)
favorite = (
windowed.groupby("user_id")["event_type"]
.agg(lambda x: x.value_counts().index[0])
.reset_index(name="favorite_event_type")
)
result = event_count.merge(active_days, on="user_id")
result = result.merge(favorite, on="user_id")
result["engagement_rate"] = result["active_days"] / window_days
return result
a) Write at least 6 unit tests using pytest. Cover: - Normal case with multiple users and event types - Empty events DataFrame - Single user with a single event - Events entirely outside the window - User with events on every day of the window - Custom window size
b) Write a pytest fixture in conftest.py that generates a reusable event DataFrame with at least 4 users, 3 event types, and events spanning 60 days.
c) Use @pytest.mark.parametrize to test the engagement rate calculation with at least 5 different (active_days, window_days, expected_rate) combinations.
d) Write one integration test that calls compute_engagement_features, merges the result with a subscriber table, and verifies that the merged DataFrame has the expected shape and no unexpected nulls.
Exercise 3: The Refactoring Kata (Code)
The following notebook code computes monetary features for the StreamFlow churn model. Refactor it into a proper function in src/features/monetary.py.
# --- Notebook Cell 23 ---
# Compute monetary features
rev_df = events_df[events_df['event_type'] == 'billing_event'].copy()
rev_df['month'] = rev_df['timestamp'].dt.to_period('M')
monthly_rev = rev_df.groupby(['user_id', 'month'])['revenue'].sum().reset_index()
avg_rev = monthly_rev.groupby('user_id')['revenue'].mean().reset_index()
avg_rev.columns = ['user_id', 'avg_monthly_revenue']
max_rev = monthly_rev.groupby('user_id')['revenue'].max().reset_index()
max_rev.columns = ['user_id', 'max_monthly_revenue']
rev_trend = monthly_rev.sort_values('month').groupby('user_id')['revenue'].apply(
lambda x: np.polyfit(range(len(x)), x, 1)[0] if len(x) > 1 else 0
).reset_index()
rev_trend.columns = ['user_id', 'revenue_trend']
monetary = avg_rev.merge(max_rev, on='user_id').merge(rev_trend, on='user_id')
Your refactored function must:
a) Have a clear function signature with type hints and a docstring.
b) Accept events and reference_date as parameters (not rely on global variables).
c) Handle edge cases: users with no billing events, users with exactly one month of data, empty DataFrames.
d) Return a DataFrame with columns ['user_id', 'avg_monthly_revenue', 'max_monthly_revenue', 'revenue_trend'].
e) Include a test that verifies the revenue trend is positive for a user whose monthly revenue increases over time, and negative for a user whose revenue decreases.
f) Replace the lambda inside .apply() with a named function. Explain why named functions are preferred over lambdas in production code.
Exercise 4: Pre-Commit Configuration (Code + Conceptual)
a) Create a .pre-commit-config.yaml file that includes:
- black (formatting)
- ruff (linting with auto-fix)
- mypy (type checking)
- nbstripout (notebook output stripping)
- A check for files larger than 1 MB
- A check for accidental commit of .env files
b) A teammate runs git commit and the pre-commit hooks reject the commit with the following output:
black....................................................................Failed
- hook id: black
- files were modified by this hook
ruff.....................................................................Failed
- hook id: ruff
- exit code: 1
src/features/build_features.py:47:5: F841 Local variable 'temp_df' is assigned
to but never used
mypy.....................................................................Failed
- hook id: mypy
src/models/train_model.py:23: error: Argument 1 to "fit" has incompatible type
"DataFrame"; expected "ndarray[Any, dtype[floating[Any]]]"
For each failure, explain: (1) what the tool detected, (2) how to fix it, and (3) what would have happened if this code had been committed without the hook.
c) Your team lead says: "Pre-commit hooks slow us down. Let's just run these checks in CI." Write a 3-4 sentence argument for why pre-commit hooks and CI checks are complementary, not alternatives.
d) A data scientist on your team commits a notebook with 200 MB of embedded images in the output cells. The check-added-large-files hook was configured with --maxkb=500 (500 KB per file). Explain why this hook did not catch the notebook, and propose a solution.
Exercise 5: Technical Debt Inventory (Conceptual)
You join a team maintaining an ML system with the following characteristics:
- Training pipeline is a 2,800-line Jupyter notebook
- Feature engineering logic is copy-pasted between training and serving (with 3 known discrepancies)
- Hyperparameters are hardcoded in 4 different files
- The test suite has 2 tests, both of which are
assert True - Model artifacts are stored on a team member's personal S3 bucket
- Data schema changes from the upstream team have broken the pipeline 3 times in the past quarter
- The model was last retrained 7 months ago
- Nobody on the current team wrote the original code
a) Categorize each item as code-level debt, data-level debt, or configuration debt. Some items may fit multiple categories.
b) Rank the items by severity. For each ranking, explain the risk: what is the worst thing that could happen if this debt is not addressed?
c) Propose a 6-week remediation plan. For each week, specify which debt item(s) you would address and what the deliverable is. Consider dependencies between items.
d) The product manager says: "We need a new feature by next sprint. Technical debt can wait." Write a response (4-6 sentences) that acknowledges the business urgency while making the case for debt reduction. Use specific examples from the inventory.
e) After your remediation, define 3 metrics you would track to prevent debt from reaccumulating. For each metric, specify the threshold that would trigger action.
Exercise 6: Mypy in Practice (Code)
Add type hints to the following functions and resolve all mypy errors. Run mypy --strict on your result.
# src/evaluation/metrics.py
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, precision_recall_curve
def compute_auc(y_true, y_prob):
if len(set(y_true)) < 2:
return None
return roc_auc_score(y_true, y_prob)
def find_optimal_threshold(y_true, y_prob, metric='f1'):
precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
if metric == 'f1':
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)
best_idx = np.argmax(f1_scores)
elif metric == 'precision_at_90_recall':
valid = recall >= 0.90
if not valid.any():
return thresholds[0]
best_idx = np.argmax(precision[valid])
else:
raise ValueError(f"Unknown metric: {metric}")
return thresholds[best_idx]
def compute_business_metrics(y_true, y_pred, revenue_per_customer, save_rate, cost_per_intervention):
tp = ((y_pred == 1) & (y_true == 1)).sum()
fp = ((y_pred == 1) & (y_true == 0)).sum()
fn = ((y_pred == 0) & (y_true == 1)).sum()
saved_revenue = tp * revenue_per_customer * save_rate
intervention_cost = (tp + fp) * cost_per_intervention
missed_revenue = fn * revenue_per_customer
net_value = saved_revenue - intervention_cost
return {
'saved_revenue': saved_revenue,
'intervention_cost': intervention_cost,
'missed_revenue': missed_revenue,
'net_value': net_value,
'roi': net_value / intervention_cost if intervention_cost > 0 else float('inf'),
}
a) Add complete type annotations to all three functions, including the return types and the types of all parameters.
b) The compute_auc function returns None when there is only one class. What type hint captures "returns a float or None"? Write a caller function that handles the None case without a mypy error.
c) The find_optimal_threshold function has a branch where best_idx might be undefined if metric matches neither branch (e.g., due to a typo in a future refactoring). How does mypy detect this? What is the simplest fix?
d) The compute_business_metrics return dictionary mixes float and int values. What is the appropriate return type hint? Would TypedDict be a better choice? Implement it and explain the tradeoff.
Exercise 7: End-to-End Refactoring Challenge (Project)
Take the StreamFlow churn model you built in Chapters 1--19 (or use the provided notebook) and refactor it into a production-quality Python package. Your deliverable must include:
a) A cookiecutter-style project structure with src/data/, src/features/, src/models/, src/evaluation/, and config/.
b) At least 15 unit tests and 2 integration tests, all passing.
c) Pre-commit hooks configured for black, ruff, and nbstripout.
d) Type hints on all public functions in src/.
e) A Makefile with targets for data, features, train, evaluate, test, lint, and format.
f) A TECH_DEBT.md file documenting at least 3 known shortcuts and their remediation plan.
g) A notebook in notebooks/ that imports from src/ and reproduces the EDA and model evaluation visualizations without containing any feature engineering or training logic.
Evaluation criteria:
- make clean && make all reproduces the model from raw data
- make test passes all tests with zero failures
- make lint returns zero errors
- The model's AUC is within 0.01 of the original notebook's AUC
Exercises correspond to Chapter 29: Software Engineering for Data Scientists. See key-takeaways.md for the core principles and further-reading.md for additional resources.