Chapter 36: Exercises

DataField.Dev

Chapter 36: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field

System Architecture

Exercise 36.1 (*)

Using the CapstoneArchitecture class from Section 36.2, add a new component called content_moderation that runs between the re-ranker and the response assembly. This component filters items that violate content policies. It requires 10ms of latency and is needed in all three tracks.

(a) Register the component with appropriate chapter origins (you may assign it to Chapter 24 for system design and Chapter 35 for interpretability — since content moderation decisions should be explainable).

(b) Re-run the latency budget check for all three tracks. Does Track C still fit within the 200ms budget? If not, which component's budget would you reduce and why?

(c) Write an ADR for the decision to add content moderation as a synchronous step rather than an asynchronous pre-filter.

Exercise 36.2 (*)

The CapstoneArchitecture.latency_budget_check() method sums all component latencies. This is incorrect for components that run in parallel (e.g., the two-tower and LightGCN retrievers in Track C).

(a) Extend the CapstoneArchitecture class to support parallel component groups. Components within a group share the maximum latency of the group rather than the sum.

from dataclasses import dataclass, field
from typing import Dict, List, Set, Tuple


@dataclass
class ParallelGroup:
    """A group of components that run in parallel.

    The group's latency contribution is max(component latencies),
    not sum(component latencies).

    Attributes:
        name: Group identifier.
        component_names: Set of component names in this group.
    """
    name: str
    component_names: Set[str] = field(default_factory=set)


@dataclass
class CapstoneArchitectureV2(CapstoneArchitecture):
    """Extended architecture with parallel group support.

    Attributes:
        parallel_groups: List of parallel execution groups.
    """
    parallel_groups: List[ParallelGroup] = field(default_factory=list)

    def critical_path_latency(self) -> float:
        """Compute critical path latency accounting for parallelism.

        Components in a parallel group contribute max(latencies)
        instead of sum(latencies). Components not in any group
        contribute their individual latency.

        Returns:
            Critical path latency in milliseconds.
        """
        active = self.active_components()
        grouped_components = set()
        total = 0.0

        for group in self.parallel_groups:
            group_components = [
                active[name]
                for name in group.component_names
                if name in active
            ]
            if group_components:
                total += max(c.latency_budget_ms for c in group_components)
                grouped_components.update(
                    name for name in group.component_names if name in active
                )

        for name, comp in active.items():
            if name not in grouped_components:
                total += comp.latency_budget_ms

        return total

(b) Add a parallel group for the two retrieval sources in Track C. What is the corrected critical path latency?

(c) Identify another pair of components in the StreamRec architecture that could plausibly run in parallel. Justify your answer.

Exercise 36.3 (**)

Design the complete StreamRec architecture for a cold-start scenario: a brand-new user with no interaction history visits the platform for the first time.

(a) Which components in the standard architecture cannot function (or function poorly) for a cold-start user? List each component and explain why.

(b) Design a fallback path that serves acceptable recommendations to cold-start users. Specify the retrieval strategy (content-based? trending? region-based?), the ranking approach (can you use the same ranker?), and the feature store behavior (what features are available?).

(c) Write an ADR for your cold-start strategy, including the alternative of waiting for N interactions before enabling personalized recommendations.

(d) How does the Bayesian cold-start handling from Chapter 20 (Beta-Binomial user preferences) integrate with your cold-start architecture? Where in the pipeline does it operate?

Exercise 36.4 (**)

The StreamRec system uses a 200ms end-to-end latency budget. A product manager proposes adding a language translation component that translates non-English item titles and descriptions into the user's preferred language, enabling cross-language recommendations. The translation service has a p50 latency of 25ms and p99 latency of 80ms.

(a) Analyze the latency impact of adding this component. Can it fit within the 200ms budget? Under what conditions?

(b) Propose three alternative integration strategies: (i) synchronous in the serving path, (ii) asynchronous pre-computation, (iii) a hybrid approach. For each, analyze the latency impact, freshness tradeoff, and infrastructure cost.

(c) Write an ADR for your recommended approach.

(d) How does cross-language recommendation interact with the creator fairness analysis from Chapter 31? Would enabling cross-language retrieval improve or worsen the exposure equity ratio for non-English creators?

Architecture Decision Records

Exercise 36.5 (*)

Write a complete ADR for the following decision: choosing between the MLP ranker (Chapter 6) and the transformer ranker (Chapter 10) for the Track B system. Use the ADR class from Section 36.4.

Your ADR must include: - Context: latency budget, feature availability, expected quality improvement - At least two alternatives (MLP, transformer, and optionally an ensemble) - Consequences: both positive (quality improvement) and negative (latency increase, maintenance complexity)

Exercise 36.6 (**)

Write an ADR for the build-vs-buy decision for the feature store. Compare: - Build: Custom feature store using Redis (online) + Parquet/Delta Lake (offline), as implemented in Chapter 25 - Buy: Feast (open-source managed feature store) - Buy: Tecton (commercial managed feature store)

For each alternative, estimate the Year 1 and Year 3 total cost (engineering time for build, license fees for buy). Include the risk of vendor lock-in for commercial solutions and the maintenance burden for the custom solution.

Exercise 36.7 (**)

You are six months into production with the Track B system. The following events have occurred: 1. The FAISS index corruption incident from Chapter 30 required 4 hours to diagnose and fix. 2. A new content format (short-form video) was launched, and the two-tower model's content encoder does not support it. 3. A regulatory inquiry from the FTC requires documentation of how recommendations are generated for users under 18.

Write ADRs for the three decisions you need to make in response: - ADR for FAISS index resilience (rebuild automation, backup indices, or switch to managed vector DB) - ADR for multi-modal content encoding (extend the two-tower model, add a separate encoder, or adopt a foundation model) - ADR for age-specific recommendation policy (filter-based, model-based, or separate model)

Evaluation and Metrics

Exercise 36.8 (*)

Using the CapstoneEvaluation class from Section 36.5, build an evaluation for a Track A system with the following metrics:

Metric	Value
Retrieval Recall@500	0.55
Ranking Hit@10	0.16
Ranking NDCG@10	0.09
Ranking AUC	0.73
p99 latency	165ms
Feature store hit rate	0.987
Creator exposure equity ratio	0.54

(a) Set appropriate thresholds for Track A and determine which metrics pass.

(b) The creator exposure equity ratio (0.54) is below the 0.60 threshold from Section 36.5. Identify three possible root causes using concepts from Chapter 31.

(c) Propose one intervention from each category (pre-processing, in-processing, post-processing) that could improve the equity ratio while minimizing Hit@10 degradation.

Exercise 36.9 (**)

The naive CTR improvement of the Track B system is 8.7% over the random baseline. The causal ATE estimate (doubly robust) is 4.1%. Explain the gap.

(a) Identify at least three confounders that could inflate the naive estimate. For each confounder, explain the mechanism by which it creates upward bias.

(b) Draw a causal DAG (using the DAG conventions from Chapter 17) showing the relationship between the recommendation, user engagement, and the confounders you identified.

(c) Compute the ratio of naive-to-causal effect (8.7 / 4.1 = 2.12). What does this ratio tell you about the fraction of observed engagement that is caused by the recommendation vs. the fraction that would have occurred anyway?

(d) Which estimate should you report to the executive team? Justify your answer using the stakeholder presentation framework from Section 36.6.

Exercise 36.10 (***)

Design a multi-objective evaluation framework that trades off Hit@10, creator exposure equity, and user diversity (fraction of distinct items across all users' recommendation lists). No single metric can capture all three objectives simultaneously.

(a) Define a Pareto frontier for these three objectives. Explain what it means for an operating point to be Pareto-dominated.

(b) Implement a MultiObjectiveEvaluator class that takes a list of recommendations and computes all three metrics. Include a method that determines whether one system Pareto-dominates another.

from dataclasses import dataclass
from typing import Dict, List, Set
import numpy as np


@dataclass
class MultiObjectiveEvaluator:
    """Multi-objective evaluation for recommendation systems.

    Computes Hit@10, creator exposure equity, and catalog diversity
    simultaneously, and supports Pareto dominance comparison.

    Attributes:
        metrics: Dictionary of metric name to value.
    """
    metrics: Dict[str, float]

    @staticmethod
    def compute(
        recommendations: List[List[int]],
        ground_truth: List[Set[int]],
        item_to_creator_group: Dict[int, str],
        creator_group_content_share: Dict[str, float],
    ) -> "MultiObjectiveEvaluator":
        """Compute all three objectives from recommendation lists.

        Args:
            recommendations: List of recommendation lists (one per user).
            ground_truth: List of ground truth item sets (one per user).
            item_to_creator_group: Mapping from item ID to creator group.
            creator_group_content_share: Content share per creator group.

        Returns:
            MultiObjectiveEvaluator with computed metrics.
        """
        # Hit@10
        hits = 0
        for recs, truth in zip(recommendations, ground_truth):
            if any(r in truth for r in recs[:10]):
                hits += 1
        hit_at_10 = hits / len(recommendations) if recommendations else 0.0

        # Creator exposure equity
        impression_counts: Dict[str, int] = {}
        total_impressions = 0
        for recs in recommendations:
            for item in recs[:10]:
                group = item_to_creator_group.get(item, "unknown")
                impression_counts[group] = impression_counts.get(group, 0) + 1
                total_impressions += 1

        equity_ratios = []
        for group, content_share in creator_group_content_share.items():
            impression_share = (
                impression_counts.get(group, 0) / total_impressions
                if total_impressions > 0 else 0.0
            )
            if content_share > 0:
                equity_ratios.append(impression_share / content_share)

        exposure_equity = min(equity_ratios) if equity_ratios else 0.0

        # Catalog diversity
        all_recommended = set()
        for recs in recommendations:
            all_recommended.update(recs[:10])
        total_items = len(item_to_creator_group)
        diversity = len(all_recommended) / total_items if total_items > 0 else 0.0

        return MultiObjectiveEvaluator(metrics={
            "hit_at_10": hit_at_10,
            "exposure_equity": exposure_equity,
            "catalog_diversity": diversity,
        })

    def dominates(self, other: "MultiObjectiveEvaluator") -> bool:
        """Check if this evaluator Pareto-dominates the other.

        Dominance requires being >= on all metrics and > on at least one.

        Args:
            other: The evaluator to compare against.

        Returns:
            True if self Pareto-dominates other.
        """
        at_least_as_good = all(
            self.metrics[k] >= other.metrics[k]
            for k in self.metrics
        )
        strictly_better = any(
            self.metrics[k] > other.metrics[k]
            for k in self.metrics
        )
        return at_least_as_good and strictly_better

(c) Consider three system configurations: (i) maximize Hit@10 with no fairness constraint, (ii) apply a post-processing exposure equity constraint, (iii) add diversity re-ranking. Under what conditions does none of these configurations Pareto-dominate the others?

Stakeholder Communication

Exercise 36.11 (*)

Using the StakeholderPresentation class from Section 36.6, generate talking points for the legal/compliance audience. Your talking points should address: - Creator exposure equity and the risk of perceived or actual discrimination - Privacy practices (what user data is collected, how it is used, whether DP is applied) - Transparency (can users understand why they see a specific recommendation?)

Exercise 36.12 (**)

A product manager asks: "Why did we spend 3 months building a recommendation system when we could have just shown users the most popular content?" Write a one-page response that:

(a) Compares the recommendation system's performance to the popularity baseline across all three evaluation levels (model, system, business).

(b) Quantifies the revenue impact difference between personalized recommendations (CTR +3.4%) and popularity ranking (CTR +0.8% over random).

(c) Addresses the fairness dimension: does popularity-based ranking have better or worse creator exposure equity than the personalized system? Explain the mechanism.

(d) Concedes the popularity baseline's advantages: simpler, cheaper, more transparent, easier to explain. Explain when and why a team should start with popularity and upgrade to personalized recommendations.

Exercise 36.13 (**)

Prepare two versions of a 5-minute system demo: 1. Version A: Technical peers. Demonstrate the system's architecture, the ADR trail, the evaluation results, and the causal impact estimate. Assume the audience understands NDCG, ATE, and PSI. 2. Version B: Executive stakeholders. Demonstrate the user-facing experience, the business impact, and the risk mitigation. Assume the audience understands revenue, retention, and regulatory risk but not NDCG or ATE.

For each version, write the script (talking points for each slide/demo screen) and identify the three most important numbers to highlight.

Technical Roadmap and Debt

Exercise 36.14 (*)

Add three more items to the TechnicalRoadmap from Section 36.7. Your items should address: 1. A model improvement not currently on the roadmap (e.g., adding session context to the two-tower model, multi-task learning for engagement prediction) 2. A monitoring improvement (e.g., adding conformal prediction-based anomaly detection to the monitoring dashboard) 3. A cost optimization (e.g., model distillation to reduce serving cost, spot instance training)

For each item, specify the expected impact, effort estimate, priority, and dependencies.

Exercise 36.15 (**)

The unscheduled debt item TD-005 (no continuous training-serving skew monitoring) is marked as high severity. Design the remediation.

(a) Using the PSI-based drift detection framework from Chapter 30, specify which features to monitor for skew, what the PSI thresholds should be, and how often to check.

(b) Design the alerting flow: what happens when skew is detected? Include the runbook (diagnostic steps, mitigation options, escalation path).

(c) Estimate the engineering effort and add it to the roadmap. Which quarter should it be scheduled in, and which existing items (if any) must it precede?

Exercise 36.16 (***)

Perform a build-vs-buy analysis for replacing the custom monitoring stack (Grafana + custom PSI computation + AlertManager) with a commercial ML monitoring platform (e.g., Arize, WhyLabs, or Evidently Cloud).

(a) List the capabilities of the custom stack (from Chapter 30) and map each to the equivalent feature in a commercial platform.

(b) Estimate the Year 1 and Year 3 costs for both options. Include engineering time for building and maintaining the custom stack, and license fees + integration time for the commercial platform.

(c) Identify the switching costs: what would migration require? How long would both systems need to run in parallel?

(d) Write an ADR for your recommendation.

Cross-Domain Transfer

Exercise 36.17 (**)

The capstone architecture is designed for a content recommendation platform. Redesign it for Meridian Financial's credit scoring system, using the same component framework.

(a) Map each StreamRec component to its credit scoring equivalent:

StreamRec Component	Credit Scoring Equivalent
Two-tower retrieval	???
MLP/transformer ranker	???
Re-ranker	???
Feature store	???
Canary deployment	???
Creator fairness audit	???

(b) What components does credit scoring require that StreamRec does not? (Hint: think about regulatory requirements from Chapters 28, 29, and 35.)

(c) What is the credit scoring system's "latency budget"? Is it real-time (sub-second) or batch? How does this change the architecture?

Exercise 36.18 (**)

Redesign the capstone architecture for a climate forecasting system that predicts regional temperature and precipitation 7 days ahead.

(a) Which StreamRec components have direct analogs in climate forecasting, and which do not?

(b) The climate system does not have "users" in the traditional sense. What is the equivalent of the "fairness audit" — i.e., for which populations or regions should forecast quality be equitable?

(c) The climate system's evaluation cannot use A/B testing (you cannot show different weather to different people). How do you establish causal impact? What evaluation strategy from Chapters 33-34 is most applicable?

Exercise 36.19 (***)

Design a pharma clinical trial analysis system using the capstone architecture framework. The system must: 1. Ingest trial data from multiple sites (analogous to the feature store) 2. Estimate treatment effects using causal methods (Chapters 16-19) 3. Quantify uncertainty (Chapter 34) 4. Ensure privacy across multi-site data (Chapter 32) 5. Produce regulatory-ready documentation (Chapter 35)

(a) Define the component inventory, mapping each to the relevant chapter.

(b) Write the evaluation strategy. What are the three levels (model, system, business) for a pharma analysis system?

(c) What is the pharma equivalent of the "ADR"? (Hint: think about the regulatory concept of a Statistical Analysis Plan.)

Full Integration

Exercise 36.20 (***)

Implement a system integration test that validates the end-to-end StreamRec pipeline on synthetic data. The test should:

from dataclasses import dataclass
from typing import Dict, List, Set, Tuple
import numpy as np


@dataclass
class SyntheticStreamRecData:
    """Generate synthetic data for integration testing.

    Creates users, items, interactions, and features sufficient
    to test the full pipeline end-to-end.

    Attributes:
        n_users: Number of synthetic users.
        n_items: Number of synthetic items.
        interaction_density: Fraction of possible interactions that occur.
        n_creator_groups: Number of distinct creator groups.
        seed: Random seed for reproducibility.
    """
    n_users: int = 1000
    n_items: int = 5000
    interaction_density: float = 0.01
    n_creator_groups: int = 5
    seed: int = 42

    def generate(self) -> Dict:
        """Generate all synthetic data.

        Returns:
            Dictionary with keys: user_features, item_features,
            interactions, creator_groups, ground_truth.
        """
        rng = np.random.default_rng(self.seed)

        # User features: [age_bucket, region, tenure_days]
        user_features = {
            uid: {
                "age_bucket": int(rng.choice([0, 1, 2, 3])),
                "region": int(rng.choice([0, 1, 2, 3, 4])),
                "tenure_days": int(rng.integers(1, 1825)),
            }
            for uid in range(self.n_users)
        }

        # Item features: [category, creator_group, freshness_days]
        creator_groups = {
            iid: int(rng.integers(0, self.n_creator_groups))
            for iid in range(self.n_items)
        }
        item_features = {
            iid: {
                "category": int(rng.integers(0, 20)),
                "creator_group": creator_groups[iid],
                "freshness_days": int(rng.integers(0, 365)),
            }
            for iid in range(self.n_items)
        }

        # Interactions: sparse user-item pairs
        n_interactions = int(
            self.n_users * self.n_items * self.interaction_density
        )
        user_ids = rng.integers(0, self.n_users, size=n_interactions)
        item_ids = rng.integers(0, self.n_items, size=n_interactions)
        interactions = list(zip(user_ids.tolist(), item_ids.tolist()))

        # Ground truth: held-out interactions for evaluation
        n_test = n_interactions // 5
        test_users = rng.integers(0, self.n_users, size=n_test)
        test_items = rng.integers(0, self.n_items, size=n_test)
        ground_truth: Dict[int, Set[int]] = {}
        for uid, iid in zip(test_users.tolist(), test_items.tolist()):
            ground_truth.setdefault(uid, set()).add(iid)

        # Creator group content shares
        group_counts = np.bincount(
            list(creator_groups.values()),
            minlength=self.n_creator_groups,
        )
        content_share = {
            g: float(count / self.n_items)
            for g, count in enumerate(group_counts)
        }

        return {
            "user_features": user_features,
            "item_features": item_features,
            "interactions": interactions,
            "creator_groups": creator_groups,
            "ground_truth": ground_truth,
            "content_share": content_share,
        }

(a) Generate the synthetic data and verify the dimensions.

(b) Build a minimal pipeline: popularity-based retrieval (top-500 most interacted items) -> random re-ranking -> top-10 selection. Compute Hit@10, NDCG@10, and creator exposure equity.

(c) Replace popularity retrieval with embedding-based retrieval (use random embeddings as a placeholder). Does the evaluation framework still work? What are the expected Hit@10 and NDCG@10 for random embeddings?

(d) Add the fairness audit. Compute exposure equity by creator group. Which group is most underserved?

Exercise 36.21 (***)

Design and implement a configuration management system for the StreamRec capstone that tracks all model hyperparameters, feature lists, deployment settings, and evaluation thresholds in a single versioned configuration.

(a) Define a configuration schema that covers: model hyperparameters (retrieval, ranking, re-ranking), feature store settings (batch frequency, streaming window), deployment parameters (canary percentage, rollback threshold), monitoring thresholds (PSI, latency SLO), and fairness thresholds (exposure equity, quality disparity).

(b) Implement the schema as a Python dataclass with validation (e.g., canary percentage must be between 0 and 1, PSI threshold must be positive).

(c) Add a diff() method that compares two configurations and lists all differences — essential for debugging "what changed between the last working version and the current broken version?"

Exercise 36.22 (****)

Design a simulation framework that models the long-term dynamics of the StreamRec recommendation system. The simulation should capture:

Feedback loops: Recommendations influence user behavior, which influences future training data, which influences future recommendations.
Creator ecosystem dynamics: Creators who receive less exposure produce less content over time. Creators who receive more exposure attract imitators.
User preference drift: User interests shift over time, introducing natural concept drift.
Intervention effects: Model how fairness interventions (exposure equity constraints) interact with feedback loops — does forcing more equitable exposure improve or degrade long-term ecosystem health?

(a) Define the simulation state: user preferences (evolving), creator population (entering/exiting), item catalog (growing/decaying), and recommendation model (retraining periodically).

(b) Implement a single simulation step: generate recommendations, simulate user responses, update state. Run for 100 time steps.

(c) Compare the long-term outcomes (Hit@10, creator diversity, user satisfaction) of three strategies: (i) pure accuracy optimization, (ii) accuracy with exposure equity constraint, (iii) accuracy with Thompson sampling exploration.

(d) This exercise connects to Chaney et al., "How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility" (RecSys, 2018). How do your simulation results compare to their findings on algorithmic monoculture?

Retrospective and Reflection

Exercise 36.23 (*)

Using the CapstoneRetrospective class from Section 36.8, create a retrospective for a hypothetical Track A implementation. Include at least one item in each category (worked_well, didnt_work, would_change, surprised_by, learned). Focus on integration challenges rather than individual component difficulties.

Exercise 36.24 (**)

You completed a Track B implementation in 120 hours. Your time breakdown was:

Activity	Hours	Percentage
Model training and tuning	25	21%
Feature engineering and store	20	17%
Pipeline and deployment	22	18%
Integration and debugging	28	23%
Evaluation and fairness	12	10%
Documentation (TDD, ADRs)	8	7%
Stakeholder presentation	5	4%

(a) The largest category is "Integration and debugging" (23%). This is consistent with the chapter's claim that integration is the hardest part. Identify five specific types of integration bugs that plausibly consumed this time. For each, describe the symptom, the root cause, and the fix.

(b) "Documentation" consumed only 7% of total time. Argue for or against the claim that this should have been higher (at least 10-15%). What documentation debt does under-investment create?

(c) If you had 80 hours instead of 120, which activities would you cut and which would you protect? Justify your answer using the Track A/B/C framework.

Exercise 36.25 (**)

Write a post-launch review for the StreamRec system after 30 days in production. The following events occurred:

Day 3: PSI alert on the user_session_length feature. Investigation revealed a logging change that altered the session timeout from 30 to 15 minutes. Impact: moderate drift, no model quality degradation detected.
Day 12: CTR dropped 1.2% (from 3.4% improvement to 2.2% improvement). Investigation revealed that a competitor launched a similar feature, reducing StreamRec's relative advantage. Not a system issue.
Day 19: Canary deployment of weekly retrained model failed validation gate — Hit@10 dropped by 0.03 on canary traffic. Automatic rollback triggered. Root cause: a data pipeline bug introduced duplicate interactions, inflating training set size by 15%.
Day 27: Creator complaint about low exposure for newly published items. Investigation confirmed the FAISS rebuild delay (TD-001) as the root cause.

(a) Classify each event: system issue, external factor, known debt, or new discovery.

(b) Which events should generate new monitoring alerts? New behavioral tests? New ADRs? New debt items?

(c) Update the technical roadmap based on these 30-day learnings. Do any priorities change?

Exercise 36.26 (***)

Compare the StreamRec capstone project to a real-world recommendation system described in one of the following papers: - Covington et al., "Deep Neural Networks for YouTube Recommendations" (RecSys, 2016) - Naumov et al., "Deep Learning Recommendation Model for Personalization and Recommendation Systems" (arXiv, 2019) — the Facebook DLRM paper - Zhao et al., "Recommending What Video to Watch Next: A Multitask Ranking System" (RecSys, 2019) — YouTube multi-task ranking

(a) Map the paper's architecture to the StreamRec component framework. Which components are present? Which are absent or described differently?

(b) Identify three design decisions in the paper that differ from the StreamRec capstone. For each, write a brief ADR explaining the paper's choice and why it may or may not be appropriate for StreamRec.

(c) What evaluation methodology does the paper use? How does it compare to the three-level evaluation framework (model, system, business) from Section 36.5?

Exercise 36.27 (****)

Design a capstone project for a domain not covered in this book. Choose one: autonomous vehicle perception, drug discovery, supply chain optimization, or fraud detection.

(a) Define the three tracks (A, B, C) with component inventories, deliverables, and expected effort.

(b) Identify which chapters from this book apply directly, which apply with modification, and which do not apply. For the gaps, identify what additional technical content would be needed.

(c) Write the TDD template (Section 36.3) adapted for your domain. Which sections remain the same? Which must be modified or replaced?

(d) Design the three-level evaluation framework. What are the model-level, system-level, and business-level metrics for your domain?