Case Study 1: StreamRec — Scaling Data Science from Startup to Mature Organization

DataField.Dev

Case Study 1: StreamRec — Scaling Data Science from Startup to Mature Organization

Context

StreamRec, the content streaming platform that has served as the progressive project throughout this book, is growing. The platform has 50 million monthly active users, 200,000 items, $400M in annual revenue, and a recommendation system that the three-person data science team built from scratch across 35 chapters. The CEO has secured Series C funding and committed to scaling the engineering organization from 50 to 500 people over the next 18 months. The data science function, currently 3 people reporting to the VP of Engineering, must grow to 30.

The CEO has hired a VP of Data Science — the book's reader, in the progressive project framing — to lead this transformation. This case study traces the first 18 months, organized by the decisions that mattered most and the mistakes that were most instructive.

Phase 1: Months 1-3 — Assessment and Quick Wins

The Initial Assessment

The new VP's first action was a listening tour: 30-minute conversations with each of the 3 existing data scientists, 6 product leads, the VP of Engineering, the VP of Product, the CFO, and the CEO. The findings:

Stakeholder	Perception of DS	Primary Need
Recommendations PM	"Our three data scientists are heroes. They built the entire recommendation system. But they are overwhelmed — I submit requests and wait 3-4 weeks."	Faster turnaround
Search PM	"I don't have a data scientist. I've been using heuristics for ranking because I can't get time from the centralized team."	Any DS resource
Ads PM	"I hired a contractor to build an ad targeting model. It's running in production. No one on the DS team knows about it."	Legitimize the ad model
VP of Engineering	"The recommendation system is a black box. When it breaks at 2am, my on-call engineers can't debug it because they don't understand the ML components."	Operability, documentation
VP of Product	"I want every product decision backed by an A/B test. Right now, only recommendations runs tests, and they use a custom Jupyter-based framework."	Self-service experimentation
CFO	"I'm about to approve a $3M annual DS budget. I need to know what we're getting for it."	ROI visibility
CEO	"I want us to be a data-driven company. I don't know what that means operationally, but I know we're not there yet."	Vision, strategy

This assessment revealed several organizational pathologies:

Prioritization bottleneck. The 3-person centralized team was the sole DS resource for the entire company. Any PM who could not wait 3-4 weeks either went without DS support or hired contractors — creating unmonitored, undocumented models in production (the ad targeting model).
No shared infrastructure. The recommendation system's feature store, experiment framework, and deployment pipeline were custom-built by the founding team and were not usable by anyone else.
No monitoring or documentation. The recommendation system worked, but no one outside the DS team could understand, debug, or maintain it.
No experimentation culture. Only the recommendations team ran A/B tests. Other teams shipped features based on intuition.

Quick Wins (Months 1-3)

The VP prioritized three quick wins — not because they were the most important long-term investments, but because they were the fastest path to demonstrating value and building credibility:

Quick Win 1: Audit the ad targeting model (Week 2). The VP assigned one data scientist to review the contractor-built ad targeting model. The review found: no validation data split (the model was evaluated on training data), no fairness audit, no monitoring, and a training-serving skew that reduced real-world performance by approximately 20% from the reported offline metrics. The VP presented the findings to the Ads PM not as criticism but as a risk assessment: "This model is generating revenue, but it has three vulnerabilities that could cause a production incident or a compliance issue. Here is a 2-week remediation plan." The Ads PM became an ally.

Quick Win 2: Launch a DS value dashboard (Month 1). Following Section 39.6.2, the VP created a monthly Slack post with four numbers: models in production (2), experiment win rate (3/5 in Q1), cumulative attributed revenue from the recommendation system ($18M annualized, from the A/B test in Chapter 33), and team health (3 of 3 team members responding "agree" or "strongly agree" to "I would recommend this team to a friend"). The CFO cited this dashboard in the next board meeting.

Quick Win 3: Document the recommendation system (Month 2-3). The VP hired a technical writer (contractor) and paired them with each data scientist for one week to produce a system architecture document, an on-call runbook, and a monitoring playbook. The VP of Engineering's on-call team could now debug recommendation system incidents without waking up a data scientist at 2am.

Phase 2: Months 4-9 — Hiring and Structural Transition

The Hiring Plan

The VP designed the 30-person target team using the hub-and-spoke model:

Team	Type	Headcount	Lead	Reports To (solid)	Reports To (dotted)
ML Platform	Hub	5	ML Platform Lead	VP of DS	—
DS Enablement + Responsible AI	Hub	3	Responsible AI Lead	VP of DS	—
Recommendations	Spoke	5	Senior DS (existing)	Recommendations PM	VP of DS
Search	Spoke	4	Senior DS (new hire)	Search PM	VP of DS
Ads	Spoke	4	Senior DS (new hire)	Ads PM	VP of DS
Content Moderation	Spoke	3	Senior DS (new hire)	Trust & Safety Lead	VP of DS
Business Analytics	Spoke	3	Senior DS (new hire)	VP of Product	VP of DS
VP of DS + DS Manager	Hub	3	VP of DS	CEO	—
Total		30

The hiring sequence was deliberate:

Months 4-6: Hire the ML Platform Lead and 2 ML platform engineers. Rationale: shared infrastructure must be built before spokes can be effective. Without a feature store, experiment platform, and model registry, each spoke will build its own — recreating the duplication the hub-and-spoke model is designed to prevent.
Months 5-8: Hire spoke leads for Search, Ads, Content Moderation, and Business Analytics. Rationale: spoke leads define the technical direction for their domain; hiring the team before the lead produces misalignment.
Months 7-12: Fill the remaining spoke positions. Rationale: by this point, the hub infrastructure is partially operational and the spoke leads can onboard new hires into a functioning team.
Months 9-12: Hire the Responsible AI Lead and DS Enablement team. Rationale: responsible AI practices must be defined before the team is large enough to generate unreviewed models — but the founding team can handle review for the first 15-20 hires.

The Structural Transition

The hardest moment came in Month 6. The three founding data scientists — who had built everything from scratch, reported to the VP of Engineering, and operated as a tight-knit trio — were told that the structure was changing. Two would remain on the Recommendations spoke (one as the spoke lead). The third would join the ML Platform hub.

The data scientist assigned to the ML Platform hub — the one who had built the custom feature store in Chapter 25 — initially resisted. "I'm a data scientist, not an infrastructure engineer." The VP reframed the role: "You built the feature store that serves 50 million users. You understand what data scientists need better than any platform engineer. Your job is to build the infrastructure that makes the next 20 data scientists as productive as you were." The data scientist accepted. Eighteen months later, they were promoted to Staff ML Engineer — and the feature store they redesigned was serving not just the recommendation system but all six spoke teams.

One founding data scientist — the one who had built the causal inference pipeline from Chapters 15-19 — left in Month 8. They cited the loss of the small-team culture and a desire to join an earlier-stage startup. The VP conducted an honest exit interview, wished them well, and used the feedback to increase investment in DS community-building: a weekly paper reading group, a monthly tech talk series, and a quarterly hackathon.

Phase 3: Months 10-18 — Scaling and Maturity

Infrastructure Milestones

Month	Milestone	Impact
10	Shared feature store v2 operational	All spoke teams use consistent features; no more ad hoc feature pipelines
11	Experimentation platform self-service launch	Product managers can configure A/B tests without DS involvement; DS reviews the analysis plan before launch
13	Model registry + CI/CD pipeline for ML	Any team can deploy a model through a standardized pipeline with testing, canary, and rollback
14	Fairness auditing framework automated	Every model deployment triggers a fairness audit; results are logged and reviewed by the Responsible AI Lead
16	Monitoring dashboard v2 (Grafana)	Four-layer dashboard (business, model, data, system) from Chapter 30, covering all production models

Culture Milestones

Month	Milestone
6	First pre-registered A/B test outside the recommendations team (Search)
8	First model deployment blocked by fairness audit (Content Moderation model had disparate impact on non-English content)
12	First negative A/B test result accepted without pushback by a VP (social recommendations feature)
15	First hackathon — 7 cross-team projects, 2 of which entered the sprint backlog
18	Experimentation maturity survey: Level 3 across all product teams, Level 4 in recommendations

By Month 18: The Organization at Scale

Metric	Month 0	Month 18
Data scientists	3	28 (2 positions still open)
Models in production	2	11
Monthly A/B tests launched	2	14
Experiment win rate	60%	42% (this is good — the team is testing riskier hypotheses)
Annualized attributed revenue	$18M \| $47M
DS team annual cost	$600K \| $4.2M
ROI	30x	11.2x (lower but still strong — infrastructure investment is amortizing)
MLOps maturity	Level 1	Level 2 (approaching Level 3 for recommendations)
Experimentation maturity	Level 2 (recommendations only)	Level 3 (all teams)

Lessons Learned

1. Infrastructure before headcount. The single highest-leverage decision was hiring the ML Platform team before filling the spokes. Without shared infrastructure, each spoke would have built its own feature store, experiment framework, and deployment pipeline — consuming 30-40% of each spoke's capacity on commodity work. The 6-month delay in filling spoke positions was painful (PMs continued to wait for DS resources), but the alternative — 5 teams building 5 feature stores — would have been worse.

2. Quick wins build political capital; political capital funds long-term investments. The ad targeting model audit, the value dashboard, and the documentation effort consumed less than one person-month of effort combined. But they established the VP's credibility with the Ads PM, the CFO, and the VP of Engineering — credibility that was essential when the VP later requested $500K for the ML platform team (a line item with no immediate revenue impact).

3. Community-building is infrastructure, not overhead. The weekly paper reading group, monthly tech talks, and quarterly hackathons consumed approximately 5% of the team's time. They were the primary mechanism for preventing the isolation that the embedded model creates. The data scientist who left cited loss of community as the primary factor — a signal that the VP initially underestimated this risk. After investing in community-building, voluntary attrition dropped to zero for the remaining 12 months.

4. The experiment win rate should decrease over time. StreamRec's win rate dropped from 60% to 42% between Month 0 and Month 18. This is a positive signal: the team is testing riskier, more ambitious hypotheses rather than only running tests they expect to win. A sustained win rate above 80% indicates that the team is not testing enough — they are using A/B tests to confirm obvious improvements rather than to learn about uncertain ideas.

5. The first blocked deployment is a culture-defining moment. When the fairness audit blocked the Content Moderation model in Month 8, the Trust & Safety PM escalated to the VP of Product. The VP of DS supported the block, presented the fairness data, and proposed a 2-week remediation. The model was re-deployed with a mitigation (threshold adjustment for non-English content). The message to the organization was clear: fairness gates are real, not decorative. No subsequent deployment was escalated.