Exercises: Chapter 29

Software Engineering for Data Scientists


Exercise 1: Project Structure Audit (Conceptual + Code)

You inherit a data science project with the following directory structure:

churn_project/
    data.csv
    churn_model_v1.ipynb
    churn_model_v2.ipynb
    churn_model_FINAL.ipynb
    churn_model_FINAL_fixed.ipynb
    model.pkl
    utils.py
    requirements.txt
    output.csv
    figures/
        roc_curve.png
        feature_importance.png
    old_stuff/
        churn_model_old.ipynb
        data_backup.csv

a) List every structural problem with this project. Consider reproducibility, collaboration, version control, data management, and maintainability.

b) Propose a refactored directory structure following cookiecutter-data-science conventions. For each file or directory in the original, state where it should go in the new structure, and explain why.

c) Write a .gitignore file for the refactored project. For each entry, add a comment explaining what it excludes and why.

d) The utils.py file contains 14 functions: 5 for data cleaning, 4 for feature engineering, 3 for visualization, and 2 for model evaluation. How would you split this into multiple modules? Write the __init__.py for each new module, exposing an appropriate public API.

e) A teammate argues: "This restructuring is going to take two days and it doesn't improve the model at all. We should be tuning hyperparameters instead." Write a response (3-5 sentences) explaining why the restructuring is worth the investment.


Exercise 2: Writing Your First Tests (Code)

Below is a feature engineering function extracted from a notebook. Write a complete test file for it.

# src/features/engagement.py

import pandas as pd
import numpy as np


def compute_engagement_features(
    events: pd.DataFrame,
    reference_date: pd.Timestamp,
    window_days: int = 30,
) -> pd.DataFrame:
    """Compute engagement features per user over a rolling window.

    Args:
        events: DataFrame with columns ['user_id', 'timestamp', 'event_type'].
        reference_date: End of the observation window.
        window_days: Number of days to look back from reference_date.

    Returns:
        DataFrame with columns:
        - user_id
        - event_count: total events in window
        - active_days: distinct days with at least one event
        - engagement_rate: active_days / window_days
        - favorite_event_type: most frequent event type
    """
    window_start = reference_date - pd.Timedelta(days=window_days)
    windowed = events[
        (events["timestamp"] >= window_start)
        & (events["timestamp"] < reference_date)
    ].copy()

    if windowed.empty:
        return pd.DataFrame(columns=[
            "user_id", "event_count", "active_days",
            "engagement_rate", "favorite_event_type",
        ])

    windowed["date"] = windowed["timestamp"].dt.date

    event_count = windowed.groupby("user_id").size().reset_index(name="event_count")
    active_days = (
        windowed.groupby("user_id")["date"]
        .nunique()
        .reset_index(name="active_days")
    )
    favorite = (
        windowed.groupby("user_id")["event_type"]
        .agg(lambda x: x.value_counts().index[0])
        .reset_index(name="favorite_event_type")
    )

    result = event_count.merge(active_days, on="user_id")
    result = result.merge(favorite, on="user_id")
    result["engagement_rate"] = result["active_days"] / window_days
    return result

a) Write at least 6 unit tests using pytest. Cover: - Normal case with multiple users and event types - Empty events DataFrame - Single user with a single event - Events entirely outside the window - User with events on every day of the window - Custom window size

b) Write a pytest fixture in conftest.py that generates a reusable event DataFrame with at least 4 users, 3 event types, and events spanning 60 days.

c) Use @pytest.mark.parametrize to test the engagement rate calculation with at least 5 different (active_days, window_days, expected_rate) combinations.

d) Write one integration test that calls compute_engagement_features, merges the result with a subscriber table, and verifies that the merged DataFrame has the expected shape and no unexpected nulls.


Exercise 3: The Refactoring Kata (Code)

The following notebook code computes monetary features for the StreamFlow churn model. Refactor it into a proper function in src/features/monetary.py.

# --- Notebook Cell 23 ---
# Compute monetary features
rev_df = events_df[events_df['event_type'] == 'billing_event'].copy()
rev_df['month'] = rev_df['timestamp'].dt.to_period('M')
monthly_rev = rev_df.groupby(['user_id', 'month'])['revenue'].sum().reset_index()
avg_rev = monthly_rev.groupby('user_id')['revenue'].mean().reset_index()
avg_rev.columns = ['user_id', 'avg_monthly_revenue']
max_rev = monthly_rev.groupby('user_id')['revenue'].max().reset_index()
max_rev.columns = ['user_id', 'max_monthly_revenue']
rev_trend = monthly_rev.sort_values('month').groupby('user_id')['revenue'].apply(
    lambda x: np.polyfit(range(len(x)), x, 1)[0] if len(x) > 1 else 0
).reset_index()
rev_trend.columns = ['user_id', 'revenue_trend']
monetary = avg_rev.merge(max_rev, on='user_id').merge(rev_trend, on='user_id')

Your refactored function must:

a) Have a clear function signature with type hints and a docstring.

b) Accept events and reference_date as parameters (not rely on global variables).

c) Handle edge cases: users with no billing events, users with exactly one month of data, empty DataFrames.

d) Return a DataFrame with columns ['user_id', 'avg_monthly_revenue', 'max_monthly_revenue', 'revenue_trend'].

e) Include a test that verifies the revenue trend is positive for a user whose monthly revenue increases over time, and negative for a user whose revenue decreases.

f) Replace the lambda inside .apply() with a named function. Explain why named functions are preferred over lambdas in production code.


Exercise 4: Pre-Commit Configuration (Code + Conceptual)

a) Create a .pre-commit-config.yaml file that includes: - black (formatting) - ruff (linting with auto-fix) - mypy (type checking) - nbstripout (notebook output stripping) - A check for files larger than 1 MB - A check for accidental commit of .env files

b) A teammate runs git commit and the pre-commit hooks reject the commit with the following output:

black....................................................................Failed
- hook id: black
- files were modified by this hook

ruff.....................................................................Failed
- hook id: ruff
- exit code: 1

src/features/build_features.py:47:5: F841 Local variable 'temp_df' is assigned
    to but never used

mypy.....................................................................Failed
- hook id: mypy

src/models/train_model.py:23: error: Argument 1 to "fit" has incompatible type
    "DataFrame"; expected "ndarray[Any, dtype[floating[Any]]]"

For each failure, explain: (1) what the tool detected, (2) how to fix it, and (3) what would have happened if this code had been committed without the hook.

c) Your team lead says: "Pre-commit hooks slow us down. Let's just run these checks in CI." Write a 3-4 sentence argument for why pre-commit hooks and CI checks are complementary, not alternatives.

d) A data scientist on your team commits a notebook with 200 MB of embedded images in the output cells. The check-added-large-files hook was configured with --maxkb=500 (500 KB per file). Explain why this hook did not catch the notebook, and propose a solution.


Exercise 5: Technical Debt Inventory (Conceptual)

You join a team maintaining an ML system with the following characteristics:

  • Training pipeline is a 2,800-line Jupyter notebook
  • Feature engineering logic is copy-pasted between training and serving (with 3 known discrepancies)
  • Hyperparameters are hardcoded in 4 different files
  • The test suite has 2 tests, both of which are assert True
  • Model artifacts are stored on a team member's personal S3 bucket
  • Data schema changes from the upstream team have broken the pipeline 3 times in the past quarter
  • The model was last retrained 7 months ago
  • Nobody on the current team wrote the original code

a) Categorize each item as code-level debt, data-level debt, or configuration debt. Some items may fit multiple categories.

b) Rank the items by severity. For each ranking, explain the risk: what is the worst thing that could happen if this debt is not addressed?

c) Propose a 6-week remediation plan. For each week, specify which debt item(s) you would address and what the deliverable is. Consider dependencies between items.

d) The product manager says: "We need a new feature by next sprint. Technical debt can wait." Write a response (4-6 sentences) that acknowledges the business urgency while making the case for debt reduction. Use specific examples from the inventory.

e) After your remediation, define 3 metrics you would track to prevent debt from reaccumulating. For each metric, specify the threshold that would trigger action.


Exercise 6: Mypy in Practice (Code)

Add type hints to the following functions and resolve all mypy errors. Run mypy --strict on your result.

# src/evaluation/metrics.py

import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, precision_recall_curve


def compute_auc(y_true, y_prob):
    if len(set(y_true)) < 2:
        return None
    return roc_auc_score(y_true, y_prob)


def find_optimal_threshold(y_true, y_prob, metric='f1'):
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    if metric == 'f1':
        f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)
        best_idx = np.argmax(f1_scores)
    elif metric == 'precision_at_90_recall':
        valid = recall >= 0.90
        if not valid.any():
            return thresholds[0]
        best_idx = np.argmax(precision[valid])
    else:
        raise ValueError(f"Unknown metric: {metric}")
    return thresholds[best_idx]


def compute_business_metrics(y_true, y_pred, revenue_per_customer, save_rate, cost_per_intervention):
    tp = ((y_pred == 1) & (y_true == 1)).sum()
    fp = ((y_pred == 1) & (y_true == 0)).sum()
    fn = ((y_pred == 0) & (y_true == 1)).sum()
    saved_revenue = tp * revenue_per_customer * save_rate
    intervention_cost = (tp + fp) * cost_per_intervention
    missed_revenue = fn * revenue_per_customer
    net_value = saved_revenue - intervention_cost
    return {
        'saved_revenue': saved_revenue,
        'intervention_cost': intervention_cost,
        'missed_revenue': missed_revenue,
        'net_value': net_value,
        'roi': net_value / intervention_cost if intervention_cost > 0 else float('inf'),
    }

a) Add complete type annotations to all three functions, including the return types and the types of all parameters.

b) The compute_auc function returns None when there is only one class. What type hint captures "returns a float or None"? Write a caller function that handles the None case without a mypy error.

c) The find_optimal_threshold function has a branch where best_idx might be undefined if metric matches neither branch (e.g., due to a typo in a future refactoring). How does mypy detect this? What is the simplest fix?

d) The compute_business_metrics return dictionary mixes float and int values. What is the appropriate return type hint? Would TypedDict be a better choice? Implement it and explain the tradeoff.


Exercise 7: End-to-End Refactoring Challenge (Project)

Take the StreamFlow churn model you built in Chapters 1--19 (or use the provided notebook) and refactor it into a production-quality Python package. Your deliverable must include:

a) A cookiecutter-style project structure with src/data/, src/features/, src/models/, src/evaluation/, and config/.

b) At least 15 unit tests and 2 integration tests, all passing.

c) Pre-commit hooks configured for black, ruff, and nbstripout.

d) Type hints on all public functions in src/.

e) A Makefile with targets for data, features, train, evaluate, test, lint, and format.

f) A TECH_DEBT.md file documenting at least 3 known shortcuts and their remediation plan.

g) A notebook in notebooks/ that imports from src/ and reproduces the EDA and model evaluation visualizations without containing any feature engineering or training logic.

Evaluation criteria: - make clean && make all reproduces the model from raw data - make test passes all tests with zero failures - make lint returns zero errors - The model's AUC is within 0.01 of the original notebook's AUC


Exercises correspond to Chapter 29: Software Engineering for Data Scientists. See key-takeaways.md for the core principles and further-reading.md for additional resources.