Case Study 1: Four Data Scientists, Four Directions


Background

Four data scientists graduated from the same bootcamp two years ago. They all completed the same curriculum: Python, SQL, statistics, machine learning, and a capstone project. They all had the same foundation --- roughly equivalent to what this textbook covers.

Two years later, they work in different roles at different companies. Their day-to-day work looks nothing alike. This case study follows each of them through a typical week, showing how the same foundation leads to very different careers depending on which skills you develop next.


The Role

Priya builds natural language processing systems that help lawyers review contracts faster. Her company's product takes a 200-page merger agreement and highlights clauses that deviate from standard language: unusual indemnification terms, non-standard termination triggers, atypical representations and warranties.

A Typical Week

Monday. A client reports that the system is flagging too many false positives on indemnification clauses in healthcare contracts. Priya pulls the production logs, examines the misclassified clauses, and discovers the pattern: healthcare M&A contracts use specialized terminology ("HIPAA representations," "PHI indemnity") that the model has not seen enough of during training. The underlying model is a fine-tuned BERT classifier. She adds 300 labeled healthcare clauses to the training set and schedules a retraining run.

Tuesday--Wednesday. The product team wants to add a new feature: summarizing each flagged clause in plain English. Priya experiments with retrieval-augmented generation (RAG) --- retrieving similar clauses from a database of previously reviewed contracts and using an LLM to generate a summary conditioned on the retrieved examples. She builds a prototype using LangChain and an embedding model for semantic search.

# Priya's RAG prototype (simplified)
from sentence_transformers import SentenceTransformer
import numpy as np

# Encode the clause database
encoder = SentenceTransformer("all-MiniLM-L6-v2")

# In production, these embeddings are pre-computed and stored in a vector DB
clause_texts = [
    "The Seller shall indemnify the Buyer against all losses arising from...",
    "Notwithstanding the foregoing, the aggregate liability shall not exceed...",
    # ... thousands of clauses
]
clause_embeddings = encoder.encode(clause_texts)

def find_similar_clauses(query_clause: str, top_k: int = 5) -> list:
    """Retrieve the most similar clauses from the database."""
    query_embedding = encoder.encode([query_clause])
    similarities = np.dot(clause_embeddings, query_embedding.T).squeeze()
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return [clause_texts[i] for i in top_indices]

Thursday. Model retraining completes. Priya evaluates the updated model on a held-out test set of healthcare clauses. Precision on indemnification clauses improves from 0.71 to 0.84. She runs the fairness check (does the model perform differently on contracts from different jurisdictions?) and checks for regression on other clause types. Everything looks clean. She pushes the model to staging.

Friday. Sprint retrospective and planning. Priya presents the healthcare improvement to the product team, framing it in client terms: "The model now correctly identifies 84% of unusual indemnification clauses in healthcare M&A, up from 71%. The false positive rate dropped from 29% to 16%, meaning lawyers spend 45% less time reviewing false flags."

What She Learned Beyond This Textbook

  • Transformer architecture (self-attention, positional encoding, encoder-decoder)
  • Fine-tuning pre-trained models (BERT, RoBERTa) for domain-specific classification
  • Embedding models and vector databases for semantic search
  • RAG architecture for combining retrieval with generation
  • LLM evaluation (hallucination detection, faithfulness metrics)
  • Tokenization details (WordPiece, BPE) and their impact on domain-specific text

What She Uses Every Day from This Textbook

  • Evaluation metrics and threshold optimization (Chapters 16--18)
  • Feature engineering intuition --- even in NLP, deciding how to represent text is a feature engineering decision (Chapter 6)
  • Monitoring for drift --- model performance degrades as legal language evolves (Chapter 32)
  • Stakeholder communication --- translating precision and recall into "time saved per lawyer" (Chapter 34)

Marcus: Computer Vision Engineer at a Manufacturing Company

The Role

Marcus builds visual inspection systems for a semiconductor manufacturer. His systems analyze high-resolution images of silicon wafers, detecting microscopic defects (scratches, particles, pattern misalignments) that are invisible to the naked eye. A missed defect costs $12,000 in downstream rework. A false alarm costs $200 in unnecessary reinspection.

A Typical Week

Monday. The fab's defect rate spiked 15% over the weekend. Marcus pulls the production images and runs them through the model. The model is catching the defects correctly --- the issue is a real quality problem, not a model problem. He reports this to the process engineering team, including the spatial distribution of defects (clustered in the upper-right quadrant of wafers from Chamber 7). This is feature engineering for manufacturing: the model detects defects, but Marcus's analysis of the spatial pattern identifies the root cause.

Tuesday--Wednesday. Marcus is training a new model for a next-generation chip design. The wafer geometry is different, and the existing model does not transfer well. He starts with a pre-trained EfficientNet backbone, freezes the early layers (which detect generic edges and textures), and fine-tunes the later layers on 2,000 labeled images of the new wafer type.

# Marcus's transfer learning setup (simplified)
import torch
import torch.nn as nn
from torchvision import models, transforms

# Load pre-trained EfficientNet
model = models.efficientnet_b0(weights="IMAGENET1K_V1")

# Freeze early layers (generic feature extraction)
for param in model.features[:6].parameters():
    param.requires_grad = False

# Replace the classifier head for binary classification
model.classifier = nn.Sequential(
    nn.Dropout(p=0.3),
    nn.Linear(model.classifier[1].in_features, 1),
    nn.Sigmoid(),
)

# Training configuration
optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=0.0001,
)
criterion = nn.BCELoss()

# Data augmentation for manufacturing images
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(90),
    transforms.ColorJitter(brightness=0.1, contrast=0.1),
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

Thursday. The new model reaches 0.97 recall and 0.91 precision on the validation set. Marcus spends the day on Grad-CAM visualizations --- heatmaps showing which regions of the image the model is attending to. This is the CV equivalent of SHAP (Chapter 19): it builds trust with the process engineers by showing that the model is looking at the defect, not at an imaging artifact.

Friday. Marcus presents to the quality team. His slide: "The new model catches 97% of defects on Gen-4 wafers with 9% false alarm rate. At current production volume (800 wafers/day), this means 776 correctly classified, 8 missed defects ($96,000 potential risk), and 72 false alarms ($14,400 reinspection cost). Net value vs. manual inspection: $340,000/month." He learned this framing from Chapter 34.

What He Learned Beyond This Textbook

  • CNN architectures (ResNet, EfficientNet, YOLO)
  • Transfer learning and fine-tuning strategies
  • Data augmentation for limited labeled datasets
  • Object detection and segmentation (not just classification)
  • Grad-CAM and other visual explanation methods
  • Edge deployment (running models on factory floor hardware with ONNX Runtime)

What He Uses Every Day from This Textbook

  • The expected value framework for asymmetric costs --- missed defects cost 60x more than false alarms (Chapter 34)
  • Class imbalance handling --- defects are 2% of all wafers (Chapter 17)
  • Model monitoring --- detecting when a new process recipe degrades model performance (Chapter 32)
  • Reproducible pipelines --- every model must be retrainable from raw data (Chapter 10)

Elena: Experimentation Lead at a Fintech Company

The Role

Elena designs and analyzes experiments for a consumer banking app with 4 million active users. Her team runs 15--20 A/B tests per month across product features, marketing messages, onboarding flows, and pricing. Her job is to ensure the company makes data-driven decisions --- and to push back when the data says "no."

A Typical Week

Monday. The product team wants to test a new savings goal feature. Elena designs the experiment: primary metric (savings deposit rate), guardrail metrics (app engagement, customer support tickets), required sample size (80,000 users for 2% minimum detectable effect at 80% power), and expected duration (3 weeks). She documents the pre-analysis plan to prevent HARKing (Hypothesizing After Results are Known).

Tuesday. Results are in for last week's pricing test. The treatment group (simplified pricing page) showed a 1.2% increase in conversion, but the 95% confidence interval is [-0.3%, 2.7%]. The VP of Product asks: "Is it significant?" Elena's answer: "The point estimate is positive, but the confidence interval includes zero. We cannot distinguish this from noise. We have three options: extend the test for two more weeks to increase power, launch to 20% of users with close monitoring, or declare the test inconclusive and move on."

Wednesday. Elena runs a causal analysis on a natural experiment. The company expanded its credit limit increase program to a new state in March. Using a difference-in-differences design, she compares spending behavior in the new state (treatment) vs. similar states where the program was not yet available (control).

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.formula.api as smf

np.random.seed(42)

# Simulate state-level monthly spending data
n_months_pre = 6
n_months_post = 3
states_treatment = ["NC", "SC", "GA"]
states_control = ["TN", "VA", "KY", "AL", "MS"]

records = []
for state in states_treatment + states_control:
    is_treatment = state in states_treatment
    for month in range(-n_months_pre, n_months_post + 1):
        if month == 0:
            continue
        post = int(month > 0)
        # Base spending with state-level variation
        base = 2400 + np.random.normal(0, 50)
        # Common time trend
        trend = 15 * month
        # Treatment effect (only for treatment states, only post)
        effect = 180 * is_treatment * post
        spending = base + trend + effect + np.random.normal(0, 80)
        records.append({
            "state": state,
            "month": month,
            "post": post,
            "treatment": int(is_treatment),
            "avg_spending": spending,
        })

df = pd.DataFrame(records)

# DiD regression
model = smf.ols("avg_spending ~ treatment * post", data=df).fit()
print(model.summary().tables[1])
print(f"\nDiD estimate: ${model.params['treatment:post']:.0f}")
print(f"95% CI: [${model.conf_int().loc['treatment:post', 0]:.0f}, "
      f"${model.conf_int().loc['treatment:post', 1]:.0f}]")

Thursday. Elena presents the quarterly experimentation report to the leadership team. Key message: "We ran 47 experiments in Q1. 12 showed statistically significant positive effects and were launched (estimated $2.1M annual revenue impact). 8 showed significant negative effects and were killed (prevented an estimated $800K in losses). 27 were inconclusive --- and that is fine. An inconclusive test is not a failure; it is information that the effect, if it exists, is too small to detect with our current sample size."

Friday. Elena reviews a junior analyst's experiment design. The analyst wants to test four different onboarding flows simultaneously. Elena explains the multiple comparisons problem, recommends a single primary metric with pre-specified Bonferroni correction, and adds a pre-registration document. This is Chapter 3 in practice, scaled to organizational complexity.

What She Learned Beyond This Textbook

  • Bayesian A/B testing and sequential testing (test continuously without inflating error rates)
  • Causal inference methods: DiD, regression discontinuity, instrumental variables, synthetic control
  • Heterogeneous treatment effects: who benefits most from a treatment?
  • Multi-armed bandits for dynamic allocation (explore vs. exploit)
  • Experimentation platform design (Eppo, Statsig, or in-house)
  • Organizational experimentation culture: how to get teams to test before launching

What She Uses Every Day from This Textbook

  • A/B testing fundamentals (Chapter 3)
  • Statistical evaluation and honest metrics (Chapters 4, 16)
  • The business communication framework --- translating p-values into dollars (Chapter 34)
  • Fairness auditing --- ensuring experiments do not disproportionately affect demographic groups (Chapter 33)

David: ML Engineer at a Ride-Sharing Company

The Role

David keeps 40+ ML models running in production. He does not build models --- the data science team does that. David builds the infrastructure that trains, deploys, monitors, and retrains models at scale. When a model breaks at 2 AM, David's pager goes off.

A Typical Week

Monday. The pricing model is serving stale predictions. David traces the issue: the nightly retraining pipeline failed because the upstream data source (ride completion events) had a schema change. The events table added a new column, which broke the feature engineering SQL query. David fixes the query, adds a schema validation check to the pipeline, and retrains the model. He adds the schema check to the CI/CD pipeline so this class of error is caught automatically in the future.

Tuesday--Wednesday. David is building a new feature store implementation. The data science team computes features in pandas during training, and the serving system recomputes them in SQL at prediction time. The two implementations have drifted apart: the training code uses fillna(median) but the serving SQL uses COALESCE(value, 0). This training-serving skew caused a 3% accuracy degradation that took two weeks to diagnose. David's solution: define features once in a feature store (Feast), materialize them to both the offline store (for training) and the online store (for serving).

# David's feature store definition (Feast)
# This is a declarative configuration, not runtime code.
from feast import Entity, Feature, FeatureView, FileSource, ValueType
from datetime import timedelta

# Define the entity (what entity do features describe?)
rider = Entity(
    name="rider_id",
    value_type=ValueType.INT64,
    description="Unique rider identifier",
)

# Define the data source
rider_features_source = FileSource(
    path="s3://features/rider_features.parquet",
    event_timestamp_column="event_timestamp",
)

# Define the feature view
rider_profile_features = FeatureView(
    name="rider_profile",
    entities=["rider_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="lifetime_rides", dtype=ValueType.INT64),
        Feature(name="avg_ride_distance_km", dtype=ValueType.FLOAT),
        Feature(name="days_since_last_ride", dtype=ValueType.INT64),
        Feature(name="preferred_payment_method", dtype=ValueType.STRING),
        Feature(name="surge_acceptance_rate", dtype=ValueType.FLOAT),
    ],
    online=True,
    source=rider_features_source,
)

Thursday. David runs the weekly model health report. Three of 40 models show drift (PSI > 0.15). Two are expected (seasonal patterns in ride demand). One is unexpected: the driver ETA model has degraded because a city closed a major bridge for construction. The model's features do not include road closure data. David flags it for the data science team and switches the model to a fallback (historical average ETA by route) until a retrained version is available.

Friday. Sprint demo. David shows the new automated canary deployment system: when a new model version is pushed, it automatically routes 5% of traffic to the new model for 24 hours. If the new model's metrics (latency, error rate, prediction distribution) are within acceptable bounds, traffic automatically scales to 100%. If not, it automatically rolls back. Zero human intervention needed.

What He Learned Beyond This Textbook

  • Container orchestration (Docker, Kubernetes) at scale
  • Feature stores (Feast, Tecton)
  • Data pipeline orchestration (Airflow, Prefect)
  • Model serving frameworks (Triton Inference Server, TensorFlow Serving, BentoML)
  • Data versioning (DVC)
  • Infrastructure as code (Terraform)
  • Observability and alerting (Prometheus, Grafana, PagerDuty)

What He Uses Every Day from This Textbook

  • Reproducible pipelines and software engineering (Chapters 10, 29)
  • Experiment tracking (Chapter 30)
  • Model deployment and API design (Chapter 31)
  • Monitoring and drift detection (Chapter 32)

Discussion Questions

  1. Skill overlap. All four data scientists use evaluation metrics, monitoring, and stakeholder communication daily. Why do these skills transfer across specializations while algorithm-specific knowledge does not?

  2. The generalist question. For the first two years after completing this textbook, is it better to specialize immediately or to stay generalist? What are the tradeoffs?

  3. Infrastructure vs. modeling. David does not build models. Is he a data scientist? How would you describe the relationship between the model-building role and the infrastructure role?

  4. The causal inference gap. Elena's work addresses a question that Priya's, Marcus's, and David's work generally does not: "Did it work?" Why is causal reasoning underrepresented in most data science curricula?

  5. Communication as career currency. Each profile ends with a stakeholder presentation. What pattern do you notice in how they frame technical results?

  6. Your path. Which of the four roles most appeals to you? What specific skill would you need to develop first to move in that direction?


This case study supports Chapter 36: The Road to Advanced. Return to the chapter for full context on each specialization path.