Case Study 2: Collaborating on a Data Project — A Team Workflow with Git

Contributors to Introduction to Data Science

Case Study 2: Collaborating on a Data Project — A Team Workflow with Git

Tier 2 — Attributed Narrative: This case study is a fictionalized account of a data science team collaboration, constructed to illustrate common workflow challenges and solutions. The characters, company, and specific project are fictional, but the technical scenarios — merge conflicts, environment mismatches, code review catches, and documentation failures — are based on common real-world experiences widely reported in data science practice.

Meet the Team

Four data scientists at a mid-sized health insurance company — Meridian Health — have been tasked with building a model to predict which members are at risk of not completing their annual wellness check. The model will be used to send targeted reminder outreach, and the team has six weeks to deliver a working prototype.

The team:

Priya (lead): 4 years of experience, strong in modeling, moderate git skills
Marcus: 2 years of experience, strong in data engineering and SQL, advanced git user
Lin: 1 year of experience, strong in visualization, minimal git experience
Devon: Summer intern, taking an introductory data science course (this course, in fact), no git experience

This case study follows the team through six weeks of collaboration, highlighting the challenges they face and the practices that help them succeed.

Week 1: Setting Up the Project

The First Decision: Repository Structure

Marcus, the most experienced git user, created the repository on GitHub and established the project structure:

wellness-prediction/
├── .gitignore
├── README.md
├── CONTRIBUTING.md
├── requirements.txt
├── data/
│   ├── raw/          # Original data exports (not in git)
│   └── processed/    # Cleaned data (not in git)
├── notebooks/
│   ├── 01-exploration/
│   ├── 02-feature-engineering/
│   └── 03-modeling/
├── src/
│   ├── __init__.py
│   ├── data_loader.py
│   └── features.py
├── tests/
│   └── test_features.py
└── results/
    ├── figures/
    └── reports/

He also wrote a CONTRIBUTING.md with the team's conventions:

# Contributing to Wellness Prediction

## Branching
- Create a branch for each task: `feature/description` or `fix/description`
- Branch from `main`, not from other feature branches
- Delete branches after merging

## Commits
- Use descriptive messages: "Add feature X because Y"
- Commit frequently — at least once per work session
- Never commit data files or credentials

## Pull Requests
- Every PR needs at least one reviewer
- Include a description of what changed and why
- Run the full pipeline before requesting review

## Code Style
- Follow PEP 8
- Use type hints for function signatures
- Set `random_state=42` everywhere
- Use relative paths only

Devon's First Commit

Devon had never used git before. Marcus spent 30 minutes walking through the basics:

# Clone the repository
git clone https://github.com/meridian-health/wellness-prediction.git
cd wellness-prediction

# Create a virtual environment
conda create --name wellness python=3.11 -y
conda activate wellness
pip install -r requirements.txt

# Create a branch for the first task
git checkout -b feature/initial-exploration

# ... do some work ...

# Stage and commit
git add notebooks/01-exploration/initial_look.ipynb
git commit -m "Add initial data exploration notebook with summary statistics"

# Push the branch
git push -u origin feature/initial-exploration

Devon made a classic beginner mistake on the first attempt: the commit message was "add notebook." Marcus gently suggested a revision: "Your future self will see this commit in the log six months from now. 'Add notebook' tells them nothing. What does the notebook do?"

Devon amended the message to "Add initial data exploration notebook with summary statistics and missingness analysis." Much better.

The Environment Problem

Lin cloned the repository and tried to run Devon's notebook. It failed immediately:

ModuleNotFoundError: No module named 'openpyxl'

Devon had installed openpyxl (needed to read Excel files) in their environment but had not updated requirements.txt. The notebook worked on Devon's machine but broke on everyone else's.

This was a teachable moment. Priya explained: "Every time you install a new package, update the requirements file. If it's not in requirements.txt, it doesn't exist for the rest of the team."

Devon added openpyxl to requirements.txt and committed:

pip freeze > requirements.txt
git add requirements.txt
git commit -m "Add openpyxl to requirements for Excel file reading"
git push

From that point on, the team adopted a rule: every PR that adds a new import must also update requirements.txt. This rule was added to CONTRIBUTING.md.

Week 2: Parallel Work and the First Merge Conflict

By week 2, the team was working in parallel:

Priya was building the feature engineering pipeline on feature/feature-engineering
Marcus was writing data loading and validation code on feature/data-pipeline
Lin was creating exploratory visualizations on feature/eda-visualizations
Devon was researching wellness check completion rates in the literature on feature/literature-review

The Conflict

Both Priya and Marcus modified src/data_loader.py. Priya added a function to load member demographics. Marcus added a function to load claims data. Both changed the same file — but different parts of it.

When Priya pushed her branch and opened a PR, it merged cleanly. When Marcus then opened his PR, GitHub reported a merge conflict:

<<<<<<< main
def load_demographics(filepath: str) -> pd.DataFrame:
    """Load and validate member demographics data."""
    df = pd.read_csv(filepath)
    df['member_id'] = df['member_id'].astype(str)
    return df
=======
def load_claims(filepath: str) -> pd.DataFrame:
    """Load and validate claims data."""
    df = pd.read_csv(filepath)
    df['claim_date'] = pd.to_datetime(df['claim_date'])
    return df
>>>>>>> feature/data-pipeline

Wait — this should not be a conflict. The two functions are entirely different. What happened?

The issue was that both Priya and Marcus had added their function at the same location in the file — at the end, after the same last line. Git could not determine the correct order. Should load_demographics come first, or load_claims?

Marcus resolved the conflict by keeping both functions in a logical order, removing the conflict markers, and committing:

git add src/data_loader.py
git commit -m "Resolve merge conflict: include both load_demographics and load_claims functions"

The Lesson

Merge conflicts are not errors — they are signals that two people touched the same part of the same file. In this case, the resolution was trivial (keep both functions). In more complex cases (two people changing the same function differently), the resolution requires understanding both changes and making a judgment call.

The team adopted two practices to reduce future conflicts: 1. Communicate about shared files. If you are modifying a file someone else is working on, give them a heads-up. 2. Merge frequently. Do not let branches diverge for weeks. Merge main into your branch daily to catch conflicts early.

Week 3: Code Review Catches a Bug

Lin submitted a pull request with three new visualization notebooks. Devon was assigned as the reviewer — Priya's philosophy was that even the most junior team member should review code, because reviewing teaches you more than writing.

Devon found something odd in Lin's code:

# Calculate completion rate by age group
completion_rate = df.groupby('age_group')['completed'].mean()

Devon commented on the PR: "This calculates the mean of the 'completed' column, but 'completed' is coded as 1 = Yes, 2 = No (not 0/1). The mean doesn't represent a rate. Should we recode to 0/1 first?"

Devon was right. The data dictionary (which Lin had not checked) coded completion as 1/2 instead of the expected 0/1. The mean of a 1/2 column produces a number between 1 and 2, not between 0 and 1. Every chart in the notebook showed incorrect rates.

Lin fixed the issue:

# Recode completed: 1=Yes→1, 2=No→0
df['completed'] = df['completed'].map({1: 1, 2: 0})
completion_rate = df.groupby('age_group')['completed'].mean()

And added a data validation check at the top of the pipeline:

assert set(df['completed'].unique()) == {0, 1}, \
    f"Expected completed to be 0/1, got {df['completed'].unique()}"

This bug would have been invisible without code review. The charts looked reasonable — the rates were between 50% and 100% when they should have been between 20% and 60%, but without domain knowledge, a viewer might not notice. The code review caught it before the flawed analysis reached stakeholders.

The Lesson

Code review is not about catching syntax errors — the computer does that. It is about catching logic errors, data misunderstandings, and methodological problems that only a human reader would notice. Devon, the least experienced team member, caught the bug because they took the time to check the data dictionary — something the more experienced Lin had assumed they already knew.

Week 4: The "Works on My Machine" Problem

Priya built the predictive model on her laptop and achieved an AUC of 0.83. She committed the notebook, pushed it, and went to a conference.

While she was away, Marcus tried to run the notebook to extend it. It failed:

OSError: [Errno 2] No such file or directory: 'C:/Users/priya/data/processed/features.parquet'

Priya had used an absolute file path. The data existed on her machine but not on Marcus's.

Marcus fixed the path to a relative path (data/processed/features.parquet), but then hit another problem: the features file did not exist on his machine because Priya's feature engineering notebook — which created the file — had not been run. The file was listed in .gitignore (correctly — data files should not be in git), but there was no documentation about how to generate it.

Marcus had to read through three notebooks to figure out the correct execution order. He eventually got it working, but it took two hours — time that could have been saved with a README section titled "How to Run the Analysis":

## How to Run

Run notebooks in this order:
1. `notebooks/01-exploration/data_overview.ipynb` — loads raw data
2. `notebooks/02-feature-engineering/build_features.ipynb` — creates features.parquet
3. `notebooks/03-modeling/train_model.ipynb` — trains and evaluates model

Each notebook depends on the output of the previous one.

Marcus added this to the README and committed it. When Priya returned from the conference, she was grateful — and slightly embarrassed that she had not written it herself.

The Lesson

Two practices would have prevented this entirely:

Use relative paths. Always. Absolute paths break on every machine except yours.
Document the execution order. If notebooks must be run in sequence, say so explicitly in the README.

Week 5: The Random Seed Revelation

The model was nearly ready. Priya reported an AUC of 0.83. Marcus ran the same notebook and got 0.81. Devon ran it and got 0.84.

Everyone was running the same code on the same data. The difference: Priya had set random_state=42 in train_test_split but not in RandomForestClassifier. The model's performance varied depending on the random initialization of the decision trees.

The fix was simple:

model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    random_state=42  # Added this line
)

After this fix, everyone got the same AUC: 0.826. But the episode revealed how subtle the random seed problem can be — you can seed some operations and miss others, leading to partial reproducibility that is harder to diagnose than complete irreproducibility.

The team adopted a new practice: at the top of every notebook, set all seeds:

import numpy as np
import random

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

And in every scikit-learn function, explicitly pass random_state=RANDOM_SEED.

Week 6: Delivering the Project

In the final week, the team prepared to hand off the project to the engineering team for deployment. This is where all the practices from the previous weeks paid off — and where the gaps became apparent.

What Worked

The git history provided a complete record of every change, every decision, and every bug fix. When an engineer asked "Why did you exclude members under 18?", the team could point to a specific commit with a message explaining the rationale.
The requirements.txt allowed the engineering team to set up the environment in minutes, not hours.
The code review history (preserved in GitHub PR comments) documented the team's thinking, including the data coding bug that Devon caught.
The README gave the engineering team a clear starting point for understanding the project.

What Could Have Been Better

No automated tests. The team had one test file (test_features.py) but it contained only two tests. The engineering team wanted comprehensive tests before deploying the model.
Inconsistent notebook quality. Some notebooks had rich Markdown narration (Lin's visualizations). Others were bare code with cryptic variable names (early notebooks from week 1). The team wished they had enforced notebook narrative standards from the start.
Data documentation was thin. The README explained how to run the code but not how the data was structured, what each column meant, or what the business rules were (e.g., "a member is considered 'at risk' if they have not completed a wellness check by October 1").
No versioned data. The data was stored on a shared drive with no versioning. When the marketing team updated member records mid-project, the data changed without notice, and the team could not go back to the version they had originally analyzed.

Reflections from the Team

After the project, Priya asked each team member: "What was the most valuable practice we used?"

Marcus: "Pull requests and code review, without question. Devon caught the data coding bug that would have invalidated the entire analysis. One pair of fresh eyes saved us weeks of rework."

Lin: "The requirements.txt. I know it sounds basic, but before this project, I had never pinned my dependencies. I can not count the number of times I've been unable to reproduce my own results because a library updated."

Devon: "Honestly, just learning git. Before this project, I had never used version control. I was the 'analysis_final_FINAL_v3' person. Now I cannot imagine working without it. The ability to go back to any previous state of the project is like a superpower."

Priya: "The CONTRIBUTING.md. Having our conventions written down meant we did not waste time debating style choices. When someone asked 'how should I name this branch?', the answer was in the document. When someone forgot to set a random seed, the document was the reference, not a person's opinion."

Discussion Questions

Devon, the least experienced team member, caught the most significant bug during code review. What does this tell you about the value of diverse perspectives in code review? Should junior team members always review senior team members' code?
The team's biggest gap was the lack of automated tests. Why are tests important for data science projects, and how do they differ from tests in traditional software engineering?
Priya used an absolute file path that broke on Marcus's machine. This is one of the most common reproducibility failures. Why do you think people keep making this mistake, even experienced data scientists?
The team wished they had versioned their data. For large datasets that change frequently, what strategies could they use? (Hint: think about DVC, database snapshots, or data lakes.)
Think about a group project you have worked on (in any context). What collaboration problems did you experience? Which of the practices from this case study would have helped?
Marcus spent 30 minutes teaching Devon the git basics. Some teams argue that training time is "wasted" when there is a deadline. How would you make the case that investing in tool training saves time in the long run?