Chapter 38: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Design Reviews and RFCs

Exercise 38.1 (*)

A mid-level data scientist on the StreamRec team proposes replacing the current two-tower retrieval model (Chapter 13) with a cross-encoder that scores every item in the catalog for every request. The proposal cites a 12% improvement in NDCG@10 on an offline benchmark.

(a) Write three design review questions you would ask before this proposal proceeds to implementation. For each question, explain what risk or assumption it surfaces.

(b) The two-tower model retrieves 500 candidates in 15ms. The cross-encoder scores each item in 0.5ms. Calculate the latency for scoring the full 200,000-item catalog. Is this feasible within StreamRec's 200ms latency budget?

(c) Propose a compromise approach that captures some of the cross-encoder's quality improvement while respecting the latency constraint. Reference specific chapters from this textbook.


Exercise 38.2 (*)

Using the DesignDocument dataclass from Section 38.2, create a complete design document for the following proposal: "Add a diversity re-ranker to the StreamRec recommendation pipeline that ensures each page of 10 recommendations contains items from at least 3 different content categories."

Fill in all fields, including at least two alternatives considered (e.g., post-hoc re-ranking vs. diversity-aware loss function during training) and at least two risks (e.g., latency impact, engagement regression).


Exercise 38.3 (**)

Write an RFC for the following organizational change: "All A/B tests on the StreamRec recommendation system must report a doubly robust causal ATE estimate (Chapter 18) in addition to the naive treatment-control difference."

Your RFC should include:

(a) Motivation: Why is the naive estimate insufficient? Use a specific example from the textbook (Chapter 15 or Chapter 33) to illustrate the bias.

(b) Proposal: What specific changes are required to the experiment analysis pipeline, the experiment report template, and the team's training?

(c) Impact: Who is affected? What are the costs (implementation effort, analysis complexity, training time)?

(d) Alternatives: At least two alternatives (e.g., require causal estimates only for experiments above a certain sample size; use a simpler method like IPW instead of doubly robust).

(e) Migration plan: How do existing experiments transition to the new standard?


Exercise 38.4 (*)

A junior data scientist submits a design document for review. The document proposes building a custom feature store from scratch using Redis and PostgreSQL, estimating 6 weeks of development time. The document does not include an "Alternatives Considered" section.

(a) Explain why the missing "Alternatives Considered" section is a problem, referencing the ADR discipline from Chapter 36.

(b) Write the "Alternatives Considered" section for this proposal. Include at least three alternatives: the proposed custom build, an open-source solution (Feast), and a managed cloud service. For each, provide the key tradeoff.

(c) Based on the build vs. buy framework from Section 38.6, which option would you recommend for a team of 9 (6 DS, 2 MLE, 1 DE)? Justify your answer using the differentiation test.


Exercise 38.5 (**)

Design a design review checklist specifically for data science projects (as opposed to general software engineering projects). Your checklist should include at least 15 items organized into categories.

Suggested categories: - Problem framing (Is this the right problem? Is it well-defined?) - Data (Is the data available, clean, and representative?) - Methodology (Is the approach appropriate? Are simpler alternatives considered?) - Evaluation (How will success be measured? Is the evaluation plan rigorous?) - Production readiness (How will this be deployed, monitored, and maintained?) - Ethical review (Are there fairness, privacy, or harm considerations?)

For each item, write the checklist question and explain what failure mode it prevents.


Mentoring and Knowledge Sharing

Exercise 38.6 (*)

A junior data scientist (6 months of experience) asks you to mentor them. In your first session, they say: "I want to get better at machine learning." This goal is too vague to be actionable.

(a) Write five specific, measurable sub-goals that could replace this vague goal. Each should be achievable within 3-6 months.

(b) For each sub-goal, describe one concrete activity the mentee could undertake and one artifact they could produce to demonstrate progress.

(c) Design a "calibration exercise" that helps you assess the mentee's current level. The exercise should take 30-60 minutes and cover both technical knowledge and practical skills.


Exercise 38.7 (**)

You are establishing a brown bag series for the StreamRec data science team. Design a 12-week schedule of topics that covers a balanced mix of: - Technical depth (specific techniques or tools) - Production engineering (deployment, monitoring, testing) - Research literacy (paper discussions) - Career and professional development - Ethics and responsibility

For each session, specify the topic, a suggested presenter (by role: junior DS, senior DS, MLE, etc.), and one discussion question to seed the conversation.


Exercise 38.8 (*)

A senior data scientist on your team has deep expertise in Bayesian methods but has never presented their work to the team or written an internal blog post. They express interest in sharing their knowledge but say they "don't have time."

(a) Propose three progressively lower-effort ways for them to share their expertise, from a full brown bag presentation to something requiring less than 30 minutes of preparation.

(b) Explain why knowledge sharing should be considered part of the job (not extracurricular) and how it could be reflected in performance evaluation criteria.


Stakeholder Management and Organizational Dynamics

Exercise 38.9 (**)

The VP of Product at StreamRec asks: "Why can't we just use ChatGPT to generate recommendations instead of all this infrastructure?" Write a response that:

(a) Acknowledges the underlying question (can LLMs replace traditional recommendation systems?)

(b) Explains the technical limitations without condescension (latency at scale, personalization from interaction data, cost per request)

(c) Identifies where LLMs could add value (catalog understanding, cold-start content embeddings, natural language explanations — referencing Chapter 11)

(d) Proposes a concrete next step (e.g., a pilot that uses LLM-generated content descriptions as features in the existing recommendation pipeline)


Exercise 38.10 (**)

Three stakeholders make competing requests for the data science team's Q2 capacity:

  • Product Manager A wants a real-time content moderation model (estimated 8 weeks of ML engineering)
  • Product Manager B wants improved recommendation quality for the mobile app (estimated 6 weeks)
  • The CFO wants a churn prediction model for the finance team's revenue forecasting (estimated 4 weeks)

The team has 10 weeks of available ML engineering capacity in Q2.

(a) You cannot do all three. Using the prioritization framework from Section 38.5 (alignment with OKRs, expected business impact, technical dependencies), write a 1-page recommendation for which projects to accept, which to defer, and why.

(b) For the deferred project(s), propose an alternative that partially addresses the stakeholder's need with lower effort.

(c) Write the email you would send to the stakeholder whose project is deferred. Follow the "acknowledge, explain, offer alternative, escalate if needed" framework from Section 38.5.


Exercise 38.11 (***)

The head of marketing at Meridian Financial asks the data science team to build a model that predicts which credit card holders are most likely to respond to a balance transfer offer. They want to target the top 10% of predicted responders with a direct mail campaign.

(a) Identify the ethical concerns with this request. Consider: Who are "likely responders"? Are they financially vulnerable? Could targeting them increase debt burden?

(b) Propose an alternative approach that balances business value (targeting likely responders) with customer welfare (avoiding harm to financially vulnerable customers). How would you frame this alternative to the marketing head?

(c) If the marketing head insists on the original approach despite your concerns, describe the escalation path you would follow. Who would you involve? What documentation would you prepare?


Exercise 38.12 (*)

Translate the following technical statement into three versions appropriate for three different audiences:

Technical statement: "Our transformer-based ranking model achieves an NDCG@10 of 0.142 on the held-out evaluation set, a 15.4% relative improvement over the previous MLP ranker. The model uses 12 attention heads with 768-dimensional embeddings and was trained on 90 days of interaction data using the BPR loss with in-batch negatives. p99 serving latency is 42ms on 4 NVIDIA A100 GPUs."

(a) Version for the VP of Product (focus on business impact and user experience)

(b) Version for the VP of Engineering (focus on system requirements and reliability)

(c) Version for the Chief Financial Officer (focus on cost and ROI)


Build vs. Buy and Strategic Decisions

Exercise 38.13 (**)

Using the BuildVsBuyAnalysis dataclass from Section 38.6, evaluate the build vs. buy decision for each of the following capabilities at StreamRec:

(a) A/B testing platform (experiment assignment, metric computation, statistical analysis)

(b) Model monitoring and drift detection

(c) Recommendation model training infrastructure

(d) Content embedding generation (converting video metadata and thumbnails into vector representations)

For each, fill in all six dimensions (differentiation, internal expertise, maintenance burden, vendor risk, time to value, total cost estimate) and justify your recommendation.


Exercise 38.14 (***)

StreamRec is evaluating a platform bet: migrating from TensorFlow to PyTorch as the primary deep learning framework. Currently, 70% of models are in TensorFlow and 30% are in PyTorch. The migration would take an estimated 6-9 months of distributed effort across all teams.

(a) Apply the platform bet evaluation criteria from Section 38.6 (community/ecosystem, hiring signal, migration path, organizational fit) to this decision. For each criterion, provide specific evidence.

(b) Identify the risks of migrating and the risks of not migrating.

(c) Propose an incremental migration strategy that reduces risk compared to a "big bang" migration.

(d) Write the RFC for this migration, following the structure from Section 38.3.


Exercise 38.15 (**)

A vendor pitches StreamRec a managed feature store service at $15,000/month. The vendor claims 99.99% uptime, sub-5ms latency, and automatic feature freshness monitoring. StreamRec currently uses a custom feature store built on Redis and Parquet (Chapter 25) maintained by the data engineering team.

(a) List the questions you would ask the vendor before evaluating the proposal. Include questions about data residency, SLA penalties, lock-in risk, and integration with existing pipelines.

(b) Estimate the current cost of the custom feature store. Consider: data engineer time (fraction of FTE), infrastructure cost (Redis cluster, storage), and incident response time.

(c) Should StreamRec migrate? Apply the differentiation test and the TCO comparison. Justify your recommendation.


Roadmap and Career Growth

Exercise 38.16 (***)

Write a 12-month technical roadmap for the Meridian Financial data science team, given the following context:

  • Team: 4 data scientists, 1 ML engineer, 1 model risk analyst
  • Current state: XGBoost credit model in production, manual validation process, quarterly model refresh, SHAP-based adverse action reasons
  • Business priorities: (1) Reduce manual underwriting volume by 15%, (2) Achieve 100% automated adverse action compliance, (3) Reduce model validation cycle from 15 business days to 5

Your roadmap should include: vision, 3-4 key bets, quarterly sequencing, team gap analysis, and at least 2 build vs. buy decisions.


Exercise 38.17 (**)

A senior data scientist asks you: "What do I need to do to get promoted to staff?" They have been at the senior level for 2 years, consistently deliver high-quality models, and receive strong performance reviews.

(a) Using the three criteria from Section 38.9 (judgment, scope, impact), assess what this person likely already demonstrates and what they likely need to develop.

(b) Propose three specific projects or activities (within the context of a team working on a recommendation system) that would help them build the missing criteria.

(c) How would you distinguish between "this person is ready for staff but lacks a visible project" and "this person is a strong senior but not yet operating at staff scope"? What observable behaviors differentiate the two?


Exercise 38.18 (*)

Write a 250-word abstract for a blog post titled "What I Learned Building a Causal Evaluation Framework for Our Recommendation System." The abstract should:

(a) Hook the reader with a surprising finding (e.g., the gap between naive and causal metrics from Chapter 36)

(b) Preview the key lessons (2-3 bullet points)

(c) Indicate the target audience (ML practitioners building recommendation systems)

(d) Be accessible to a reader who has not taken this course


Exercise 38.19 (***)

Design a "staff readiness rubric" that a data science manager could use to evaluate whether an IC is ready for promotion to staff. The rubric should:

(a) Cover at least 8 dimensions (e.g., technical depth, cross-team influence, mentoring, communication, judgment)

(b) For each dimension, define what "meets expectations at senior level," "approaching staff level," and "operates at staff level" looks like, with concrete behavioral examples

(c) Include at least 2 dimensions that are specific to data science (as opposed to general software engineering)


Exercise 38.20 (**)

You join a data science team of 15 as a new staff data scientist. After your first month, you observe the following problems:

  • There is no design review process; projects start with a Slack message and a Jira ticket
  • Three teams are building independent feature stores with incompatible schemas
  • The most experienced data scientist is a bottleneck for all model reviews but has no formal mentoring structure
  • Experiment results are reported inconsistently across teams (some use p-values, some use confidence intervals, some report neither)

(a) Prioritize these four problems. Which do you address first, and why?

(b) For your highest-priority problem, draft a 90-day action plan with specific milestones.

(c) For each problem, identify whether the appropriate intervention is a design review, an RFC, a mentoring initiative, or something else.


Scenario-Based Integration

Exercise 38.21 (***)

The TerraML climate team presents a deep learning model for regional precipitation forecasting. The model achieves state-of-the-art performance on historical data but has the following characteristics:

  • Training requires 8 A100 GPUs for 72 hours
  • Inference latency is 45 seconds per forecast region
  • The model has 1.2 billion parameters
  • Only two people on the team understand the architecture

(a) As the staff data scientist, what concerns would you raise in a design review? Organize your concerns into categories (operational, organizational, scientific, financial).

(b) Apply Theme 6 (Simplest Model That Works): what simpler alternatives should be evaluated before committing to this model? Reference specific chapters.

(c) If the team demonstrates that the complex model significantly outperforms simpler alternatives, what infrastructure and organizational investments are needed to make it production-viable?


Exercise 38.22 (***)

Write a "technical strategy memo" (2-3 pages) addressed to the VP of Engineering at Meridian Financial. The memo should argue for investing in a causal inference capability within the data science team. Structure it as follows:

(a) The business case: What decisions at Meridian would benefit from causal analysis? (Credit limit changes, marketing campaign targeting, fraud intervention effectiveness)

(b) The current gap: What can the team do today (predictive modeling) and what can it not do (causal estimation)? Use a specific example to illustrate the gap.

(c) The investment required: What skills need to be hired or developed? What infrastructure changes are needed? What is the timeline?

(d) The expected return: How would causal capability change decision quality, and what is the estimated business impact?


Exercise 38.23 (**)

A product manager requests a "real-time personalized pricing" model for StreamRec's subscription tiers. The model would predict each user's willingness to pay and display a personalized price.

(a) Identify the ethical, legal, and reputational risks of this request. Consider price discrimination law, user trust, and public perception.

(b) Say no using the framework from Section 38.5. Write the response you would give in the stakeholder meeting.

(c) Propose an alternative that achieves the PM's underlying goal (revenue optimization) without personalized pricing. Consider: segment-based pricing, promotional offers, feature-tier differentiation.


Exercise 38.24 (****)

The staff data scientist role is sometimes criticized as a "glass ceiling" — a role with high expectations, ambiguous authority, and limited career progression beyond it (since principal and distinguished roles are rare). Write a 1,000-word essay evaluating this critique. Address:

(a) The structural tension between influence without authority and accountability for outcomes

(b) Whether the IC track genuinely provides equivalent career progression to the management track, or whether it is a "consolation prize" for people who do not want to manage

(c) What organizational structures or cultural norms make the staff IC role sustainable and fulfilling vs. frustrating and limiting

(d) How the data science staff role differs from the software engineering staff role (if at all) in terms of scope, authority, and organizational positioning


Exercise 38.25 (**)

You are writing the "Team Gaps" section of the StreamRec technical strategy document (Section 38.11). The team has the following composition:

Role Count Strengths Gaps
Senior DS 2 Deep learning, NLP No causal inference experience
Mid-level DS 3 General ML, experimentation Limited production deployment experience
Junior DS 1 Statistics, Python No domain experience
ML Engineer 2 PyTorch serving, Kubernetes No experience with fairness tooling
Data Engineer 1 Spark, Kafka, feature pipelines No ML background

(a) Map each key bet from Section 38.11 to the team members best suited to lead it. Identify which bets have skill gaps that must be addressed.

(b) For each gap, propose a plan: hire, train (internal brown bag, external course, mentoring), or contract (consulting engagement). Justify your choice based on the gap's urgency and the market availability of the skill.

(c) Design a 6-month internal training plan that addresses the two most critical gaps without requiring new hires.


Exercise 38.26 (**)

MediCore Pharma's data science team produces a hierarchical Bayesian model (Chapter 21) estimating treatment effects across 12 hospital sites. The FDA reviewer asks: "Why should I trust a model where the estimates for individual sites are influenced by data from other sites?"

(a) Write a 1-paragraph response to the FDA reviewer that explains partial pooling in non-technical language, without using the terms "prior," "posterior," "hyperparameter," or "shrinkage."

(b) Explain why the regulatory communication challenge is a staff-level skill. What could go wrong if a junior data scientist wrote this response?

(c) Propose a visualization that would accompany the written response and make the concept of partial pooling intuitive. Describe what the axes would show and what the viewer should take away.


Exercise 38.27 (**)

You inherit a data science team that has no writing culture. All knowledge lives in individual notebooks, Slack threads, and people's heads. When someone leaves, their knowledge leaves with them.

(a) Design a 90-day plan to establish a writing culture. Include: the platform (wiki, blog, shared drive), the initial content, the incentives (recognition, inclusion in performance reviews), and the cadence.

(b) Write the first post yourself — a 500-word template titled "How We Decided to [Technical Decision]." The template should have sections that any team member can fill in for their own decisions.

(c) How do you handle resistance from team members who say writing is not their job?


Exercise 38.28 (***)

Design a quarterly "state of the platform" review for the StreamRec recommendation system. This is a recurring meeting where the staff data scientist presents the system's current health, recent wins, emerging risks, and upcoming priorities to a mixed audience of data scientists, ML engineers, product managers, and engineering leadership.

(a) Define the agenda (30-60 minutes). What sections does the review cover?

(b) For each section, specify the key metric or artifact that would be presented.

(c) Define "red/yellow/green" criteria for each metric that trigger different levels of action.

(d) Write a sample one-slide summary for Q1 that reports on the cold-start model deployment (Bet 1 from Section 38.11).


Exercise 38.29 (****)

The concept of "technical debt" (Sculley et al., 2015) is well-established in ML systems. Propose an analogous concept of "organizational debt" in data science teams — the accumulated cost of shortcuts in processes, documentation, knowledge sharing, and team development.

(a) Define 5-7 types of organizational debt with concrete examples from a data science team.

(b) For each type, describe how it compounds over time (the "interest" on the debt).

(c) Propose a scoring rubric (analogous to the ML Test Score from Breck et al., 2017) that quantifies a team's organizational debt.

(d) Apply your rubric to the StreamRec team as described in this chapter. What is their score, and what should they address first?


Exercise 38.30 (**)

Reflect on your own career trajectory (or your intended trajectory). Based on the IC vs. management distinction in Section 38.1:

(a) Which track appeals to you more, and why? Be specific about which activities (design reviews, mentoring, hiring, performance management, architecture) you find energizing vs. draining.

(b) Identify one dimension from each track that you would want to develop regardless of which track you choose. (For example, a committed IC might still benefit from understanding the hiring process; a committed manager might still benefit from maintaining design review skills.)

(c) Write your own "staff readiness" self-assessment using the three criteria from Section 38.9 (judgment, scope, impact). Where are you strongest? Where do you most need to grow?