Chapter 39: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Team Structure

Exercise 39.1 (*)

StreamRec currently has 3 data scientists and is planning to grow to 12 within the next year. Using the OrgContext class from Section 39.2.4, answer the following:

(a) At what team size should StreamRec transition from a centralized model to a hub-and-spoke model? Vary team_size from 3 to 30 while keeping other parameters fixed at the StreamRec scaling values (regulatory_intensity=2, domain_complexity=5, infra_maturity=7, num_stakeholder_groups=6). At what threshold does the recommendation change?

(b) Now vary infra_maturity from 1 to 10 for a 15-person team. How does infrastructure maturity affect the structural recommendation? Explain why this makes sense.

(c) Identify one limitation of the heuristic in recommend_structure(). Propose an additional factor that should be included in the decision and explain how it would modify the logic.


Exercise 39.2 (*)

Map the team structures of the four anchor examples to the centralized/embedded/hub-and-spoke taxonomy:

(a) MediCore Pharmaceuticals has 12 data scientists: 4 in the biostatistics group, 3 in pharmacovigilance, 2 in commercial analytics, 2 in manufacturing quality, and 1 DS director. Which structure does this most closely resemble? What are the risks?

(b) Meridian Financial has 22 data scientists: a 6-person central model risk management team and 16 data scientists embedded across credit risk, fraud detection, marketing analytics, and operations research. Classify this structure. What is the central team's primary function — is it a "hub" in the hub-and-spoke sense?

(c) The Pacific Climate Research Consortium has 8 data scientists across 3 universities and 2 agencies. No one institution employs more than 2. What structural challenges does this create that a single-organization team of 8 would not face? Propose a coordination mechanism.


Exercise 39.3 (**)

Design a team structure for each of the following organizations. For each, specify: structure type, hub vs. spoke composition, reporting lines, and the primary risk the structure is designed to mitigate.

(a) A 200-person e-commerce company with 5 data scientists, no ML in production, and a CEO who read an article about AI and wants results in 6 months.

(b) A 50,000-person insurance company with 80 data scientists across actuarial, claims, underwriting, and customer analytics. The actuarial team has used models for decades; the customer analytics team was formed last year. Regulatory requirements mandate independent model validation.

(c) A 15-person climate tech startup building a carbon credit verification platform. The team includes 4 ML engineers, 2 climate scientists, and 1 product manager. Funding runs out in 18 months.


Exercise 39.4 (**)

StreamRec is transitioning from centralized to hub-and-spoke. The current 12-person centralized team will be split into a 4-person hub and 8 embedded data scientists across 4 product teams.

(a) Design the transition plan. What happens in month 1? Month 2? Month 3? What are the riskiest moments?

(b) Two of the strongest data scientists have expressed concern about losing their community and are considering leaving. What concrete steps can you take to retain them while still executing the restructuring?

(c) The VP of Search insists that the embedded DS should report directly to them with no dotted-line relationship to the hub. The VP of Recommendations is fine with the dotted-line model. How do you handle the inconsistency? What are the long-term consequences of each approach?


Hiring

Exercise 39.5 (*)

A candidate for a senior data scientist position at StreamRec submits a take-home assignment. Evaluate it using the TakeHomeRubric from Section 39.3.2:

  • Problem framing: The candidate correctly identified the problem as a ranking task but did not define success metrics before building models. Score: 3.
  • Methodology: Used XGBoost with minimal feature engineering. Did not establish a baseline or compare alternatives. Score: 2.
  • Code quality: Clean, well-structured code with type hints and docstrings. Would be deployable with minor modifications. Score: 5.
  • Communication: The written summary is clear but focuses on model architecture rather than business implications. Score: 3.
  • Rigor: No train/test split validation. Did not address data leakage, class imbalance, or feature importance analysis. Score: 2.

(a) Compute the weighted score. Does the candidate pass the 3.5 threshold?

(b) If you could advance this candidate to the on-site despite a below-threshold score, what specific areas would you probe in the technical deep dive?

(c) A second candidate scores: problem framing 5, methodology 3, code quality 2, communication 4, rigor 4. Compute the weighted score. Which candidate would you advance, and why?


Exercise 39.6 (**)

Design a complete hiring process (5 stages as described in Section 39.3.2) for each of the following roles at StreamRec:

(a) A junior ML engineer (0-2 years of experience) who will build and maintain the feature store and serving infrastructure.

(b) A senior applied scientist (5+ years) who will lead the causal inference practice and design the experimentation strategy.

(c) A responsible AI lead who will build the fairness auditing framework and establish the pre-deployment review process.

For each role, specify: what the take-home assignment looks like, what the on-site sessions assess, and what the most important differentiating signal is.


Exercise 39.7 (**)

The following questions appeared in data science interviews at different companies. Classify each as testing a production-relevant skill or an interview artifact (a skill that helps in interviews but does not predict on-the-job performance). Justify each classification.

(a) "Implement a binary search tree and find the k-th smallest element."

(b) "A product manager tells you that your A/B test showed a 2% lift in CTR but the sample size was only 500 users. They want to ship the feature. What do you do?"

(c) "Explain the difference between a GAN and a VAE."

(d) "Here is a dataset of customer transactions. In the next 30 minutes, explore the data and tell me three things that would be useful for the marketing team to know."

(e) "Design a system to detect fraudulent transactions in real-time. Walk me through the architecture, the features you would use, how you would evaluate the model, and how you would monitor it in production."

(f) "Implement quicksort from memory."


Exercise 39.8 (***)

Your take-home assignment for MediCore Pharmaceuticals needs to assess a candidate's ability to do causal inference in a regulatory context. Design the take-home:

(a) Create a synthetic dataset (describe the data-generating process) that includes a treatment variable, a confounded outcome, and two covariates — one of which is a valid adjustment variable and one of which is a collider. The dataset should have ~5,000 rows.

(b) Write the instructions given to the candidate. The instructions should ask the candidate to estimate the treatment effect, but should NOT tell them which variables to adjust for. A strong candidate will draw a causal DAG and identify the collider; a weak candidate will adjust for everything.

(c) Design the rubric. What are the pass/fail signals? What distinguishes a "good" answer from an "exceptional" one?


Culture

Exercise 39.9 (*)

Classify each of the following organizational behaviors according to the experimentation maturity scale (Level 0-5) from Section 39.4.1:

(a) "We track daily active users, but product decisions are made by the VP based on customer feedback calls."

(b) "Every feature launch requires an A/B test. Last quarter, we killed two features that tested negative, even though they were championed by senior leaders."

(c) "We run A/B tests when the product team is unsure. But if the designer is confident in the new design, we skip the test."

(d) "Our last experiment showed no effect on our primary metric, but we discovered that it increased engagement among new users while decreasing it among power users. We are now investigating why, and the insight is reshaping our user segmentation strategy."

(e) "We have a dashboard that shows A/B test results. Most teams check it after launch to confirm that things look okay."


Exercise 39.10 (**)

StreamRec's DS team has identified three rigor gaps in its current practice. For each, design a specific organizational process (not a technology) that addresses the gap:

(a) Gap: One-off analyses are shared as Jupyter notebooks via Slack. There is no review process. A recent analysis that informed a $2M marketing campaign contained a data leakage bug that inflated the estimated effect by 3x.

(b) Gap: A/B test results are analyzed post-hoc — the primary metric, success criterion, and statistical test are chosen after the test concludes. The team suspects that at least two "successful" features were approved based on cherry-picked metrics.

(c) Gap: A model deployed 8 months ago has never been re-evaluated. The team member who built it has since left. No one knows if it still performs acceptably, and the monitoring dashboard was never configured.


Exercise 39.11 (**)

Design an ethics review process for StreamRec's recommendation system. The process should be specific, operational, and integrated into the existing development workflow:

(a) What triggers an ethics review? (Every model change? Only major changes? Changes that affect specific populations?)

(b) Who conducts the review? (The model owner? A separate responsible AI team? A cross-functional committee?)

(c) What does the review evaluate? (Define a checklist with specific items and pass/fail criteria.)

(d) What authority does the review have? (Advisory only? Can it block deployment? Can it be overridden, and by whom?)

(e) How does the review interact with the deployment pipeline from Chapter 29? (Does it come before canary? During canary? After full rollout?)


Exercise 39.12 (***)

MediCore Pharmaceuticals operates in a regulatory environment where the FDA may audit any analysis at any time. Design a "rigor infrastructure" that makes regulatory-grade reproducibility the path of least resistance for MediCore's data scientists:

(a) What technical systems are required? (Version control, environment management, data versioning, audit logging, etc.)

(b) What organizational processes are required? (Code review, analysis pre-registration, approval workflows, etc.)

(c) How do you prevent this infrastructure from becoming so burdensome that data scientists route around it? (The "compliance theater" failure mode.)


Scaling Impact

Exercise 39.13 (*)

Classify each of the following StreamRec activities as a "project" or a "capability" (Section 39.5.1-39.5.2):

(a) Building a churn prediction model for the product team.

(b) Building a model registry that stores, versions, and serves any model in the organization.

(c) Analyzing the impact of a recent pricing change on subscriber retention.

(d) Building a fairness auditing framework that runs automatically on every model before deployment.

(e) Building a custom recommendation model for a new content category (podcasts).

(f) Building a self-service experimentation platform that allows product managers to configure and launch A/B tests without DS involvement.


Exercise 39.14 (**)

StreamRec's VP of Product has proposed five projects for Q3. Use the DSProject class and prioritize_portfolio() function from Section 39.5.3 to prioritize them within a $600,000 quarterly budget:

| Project | Impact ($) | P(success) | Cost ($) | Duration (mo) | Strategic alignment | Capability? | |---------|-----------|------------|---------|---------------|-------------------|------------| | Podcast recommendations | 900,000 | 0.65 | 250,000 | 4 | 0.8 | No | | A/B testing self-service | 500,000 | 0.85 | 200,000 | 3 | 0.9 | Yes | | Creator analytics dashboard | 300,000 | 0.9 | 100,000 | 2 | 0.6 | Yes | | Video transcript search | 1,200,000 | 0.4 | 350,000 | 6 | 0.7 | No | | Data quality monitoring | 400,000 | 0.95 | 150,000 | 3 | 0.8 | Yes |

(a) Compute EVI, ROI, and priority score for each project.

(b) Run the portfolio selection. Which projects are selected? What is the total expected value?

(c) The VP of Product is disappointed that the podcast recommendation project was not selected (assuming it was not). Prepare a 2-paragraph explanation that uses the framework's logic to explain the decision without dismissing the VP's priorities.


Exercise 39.15 (***)

The prioritize_portfolio() function uses a greedy algorithm (sort by priority, pick until budget exhausted). This is not optimal — it can miss portfolios that pack the budget more efficiently.

(a) Show that portfolio selection is a variant of the 0/1 knapsack problem.

(b) Implement an exact solution using dynamic programming (with discretized costs) and compare the result to the greedy solution on the Exercise 39.14 data.

(c) Under what conditions does the greedy solution match the optimal solution? When does it diverge significantly?


Exercise 39.16 (**)

Define the MLOps maturity level (0-3, from Chapter 29 and Section 39.5.4) for each anchor example:

(a) StreamRec at the end of the progressive project (Chapter 36, Track B).

(b) MediCore Pharmaceuticals, which runs batch causal analyses monthly with Dagster pipelines and manual review gates.

(c) Meridian Financial, which has automated retraining, shadow scoring, and quarterly regulatory validation.

(d) The Pacific Climate Research Consortium, which trains models in Jupyter notebooks on a shared GPU server and delivers results as PDF reports.

For each, identify the single highest-leverage investment that would advance the organization to the next maturity level.


Measuring and Communicating Value

Exercise 39.17 (*)

StreamRec's recommendation system generates the following measurable outcomes in Q1:

  • A/B test shows a 0.3-minute increase in daily engagement per user (causal ATE, p < 0.01)
  • 50 million monthly active users
  • Average revenue per engagement-minute: $0.008
  • DS team Q1 cost: $750,000 (salaries, compute, tools)

(a) Compute the annualized causal revenue contribution of the recommendation system.

(b) Compute the quarterly and annual ROI of the DS team (using only this model's contribution).

(c) Why is this ROI estimate a lower bound? What other value does the DS team create that is not captured in this calculation?


Exercise 39.18 (**)

Design the monthly DS value dashboard (Section 39.6.2) for each anchor example:

(a) StreamRec: What are the four numbers? What time period does each cover?

(b) Meridian Financial: How do you measure "models in production" when the credit scoring model is one model serving all applications? What replaces "experiment win rate" in a setting where A/B testing is not feasible?

(c) MediCore: What replaces "cumulative attributed revenue" when the output is causal analyses for regulatory submissions?

(d) Pacific Climate Consortium: What replaces "team health" when the team spans 5 institutions with different HR systems?


Exercise 39.19 (***)

The CEO of StreamRec asks: "What would happen if we disbanded the data science team and used the $3M annual budget for something else?" This is the ultimate ROI question — the counterfactual cost of eliminating the function.

(a) List all the systems and capabilities that would degrade or fail within 30 days, 90 days, and 12 months.

(b) For each, estimate the business impact (in dollars, user metrics, or risk) of the degradation.

(c) Why is this thought experiment more useful for demonstrating DS value than the standard ROI calculation? What are its limitations?


Exercise 39.20 (**)

Prepare a three-slide executive briefing for Meridian Financial's CTO. The DS team has: - Deployed a new credit scoring model that reduces false negative rate (missed defaults) by 12% while maintaining the same approval rate - Completed a fairness audit showing that the adverse impact ratio improved from 0.72 to 0.81 (four-fifths rule threshold: 0.80) - Identified a data quality issue in the address verification feature that, if unaddressed, could cause regulatory findings

Design each slide (title, 3-4 bullet points, one key number per slide). Remember: the CTO's decisions are about risk, compliance, and cost — not model architecture.


Organizational Design Synthesis

Exercise 39.21 (***)

A healthcare company (5,000 employees, $800M revenue) has just hired its first VP of Data Science. Currently, the company has 8 data scientists scattered across clinical operations, supply chain, marketing, and IT — each reporting to different VPs. None of them use the same tools, coding standards, or data infrastructure. Two models are "in production" (a demand forecasting model run manually in Excel and a patient readmission model that was deployed 2 years ago and has not been monitored).

(a) Diagnose the current state using the frameworks from this chapter (team structure, MLOps maturity, experimentation maturity, organizational capability vs. project orientation).

(b) Design a 12-month transformation plan. What happens in months 1-3? 4-6? 7-12?

(c) What are the top 3 risks to the plan? For each, describe a concrete mitigation strategy.

(d) The CFO is skeptical of the DS investment and has given the VP 6 months to demonstrate value. What "quick wins" should the VP prioritize to build credibility while simultaneously investing in longer-term capability?


Exercise 39.22 (***)

Compare the build-vs-buy decision across the four anchor examples for each of the following components:

Component StreamRec MediCore Meridian Climate
Feature store
Experiment platform
Model monitoring
Model serving
Fairness auditing

For each cell, recommend "build," "buy," or "open-source + customize," and justify based on the organization's specific context (regulatory requirements, team size, domain specificity, differentiation value).


Exercise 39.23 (***)

Design a data science career framework for StreamRec with 5 levels:

Level Title Scope Key Differentiator
L3 Data Scientist Individual project
L4 Senior Data Scientist Multiple projects, mentoring
L5 Staff Data Scientist Team/org technical direction
L6 Principal Data Scientist Company-wide technical strategy
L7 Distinguished Data Scientist Industry-level impact

(a) For each level, define: technical expectations, scope of impact, communication expectations, and leadership expectations.

(b) What is the most common failure mode at each level transition? (e.g., L3 to L4: "Continues to work on individual projects instead of mentoring others and multiplying impact.")

(c) How does this framework differ for MediCore (where regulatory expertise is essential) vs. the Pacific Climate Consortium (where publication record matters)?


Exercise 39.24 (****)

The "Dunbar number" for effective teams is often cited as 5-9 people (following Miller's 7 +/- 2 or Bezos's two-pizza rule). As a DS organization grows from 5 to 50 to 500 people, the number of communication channels grows quadratically ($n(n-1)/2$).

(a) Model the communication overhead as a function of team size. If each communication channel requires $c$ hours per week of synchronization, at what team size does communication overhead consume more time than productive work?

(b) Show how team structure (centralized, embedded, hub-and-spoke) reduces the effective number of communication channels. Compute the channel count for a 30-person team under each structure, assuming: centralized has one team meeting, embedded has 6 pod meetings of 5, and hub-and-spoke has 6 pod meetings of 4 plus one hub meeting of 10 (4 hub + 6 spoke leads).

(c) The "Ringelmann effect" (social loafing) suggests that individual productivity decreases as team size increases. If productivity per person scales as $1/\log(n)$ for teams of size $n$, what is the optimal team size that maximizes total output $n / \log(n)$? How does this relate to the pod sizes in the hub-and-spoke model?


Exercise 39.25 (****)

StreamRec is considering hiring a Head of Responsible AI — a dedicated role for fairness, privacy, and ethical AI practice. Some team members argue that responsible AI should be everyone's responsibility, not one person's job.

(a) Present the case for a dedicated Head of Responsible AI. What fails when responsible AI is "everyone's job"?

(b) Present the case against the role. What fails when responsible AI is one person's job?

(c) Design a hybrid model: a dedicated responsible AI lead who sets standards and builds tools, combined with embedded "responsible AI champions" in each product team who conduct the day-to-day reviews. Define the responsibilities of each role and the escalation process for disagreements.

(d) How does this hybrid model relate to the hub-and-spoke team structure? Draw the analogy explicitly.


Exercise 39.26 (****)

The Pacific Climate Research Consortium faces a unique organizational challenge: the academic incentive structure (publish papers, win grants) conflicts with the policy impact mandate (deliver actionable projections to decision-makers).

(a) A postdoc has developed a novel uncertainty quantification method that improves calibration by 15% over the current approach. They want to hold the results until the paper is published (estimated 6 months). The policy team needs the improved projections for a legislative briefing in 8 weeks. How do you resolve this?

(b) Design an incentive structure that aligns academic and policy objectives. What does "credit" look like in this structure?

(c) More broadly, how do you measure the "business value" of data science when the output is public-good policy recommendations rather than private-sector revenue? Propose three metrics.


Exercise 39.27 (***)

You are the newly hired VP of Data Science at a fintech company similar to Meridian Financial. In your first month, you discover:

  • The credit scoring model was last validated 18 months ago (regulatory requirement: annually)
  • The model monitoring dashboard has been broken for 3 months and no one noticed
  • The data science team has 0% voluntary attrition because team members are comfortable but unchallenged
  • The fraud detection model was deployed by a contractor who has since left; no one on the current team understands its architecture

(a) Triage these issues: which must be addressed in the first week? First month? First quarter?

(b) For the regulatory validation gap, draft a 1-paragraph communication to the Chief Risk Officer explaining the situation and your remediation plan.

(c) The 0% attrition is presented by the HR team as a positive metric. Explain why it might actually indicate an organizational problem, and what you would investigate.


Exercise 39.28 (***)

Design a quarterly hackathon format for StreamRec's 30-person DS organization. The hackathon should:

(a) Encourage cross-team collaboration (embedded data scientists working with people from other spokes)

(b) Produce artifacts that have a realistic chance of becoming production features (not just demos)

(c) Include a responsible AI component (every hackathon project must include a fairness or privacy analysis)

(d) Be completable in 2.5 days (not the unrealistic "48-hour hack-a-thon" that produces burnout and throwaway code)

Define: team formation process, project scoping guidelines, deliverable requirements, judging criteria, and the process for transitioning winning projects into the regular sprint cycle.