Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams

Contributors to Introduction to Data Science

22 min read

> "Your most important collaborator is your future self — and they won't remember why you did what you did."

Learning Objectives

Explain why reproducibility matters for scientific credibility and practical collaboration
Initialize a git repository and perform the core workflow — add, commit, diff, log, and status
Write clear, informative commit messages that explain why changes were made
Create and merge branches for parallel work, and understand the purpose of pull requests
Create and manage virtual environments using conda or venv to isolate project dependencies
Write a requirements.txt or environment.yml file that allows others to recreate your environment
Write a README that explains what a project does, how to set it up, and how to use it
Set random seeds for reproducibility in analyses that involve randomness

In This Chapter

Chapter Overview
33.1 The Reproducibility Crisis
33.2 Version Control: What It Is and Why You Need It
33.3 Getting Started with Git: The Core Workflow
33.4 Writing Good Commit Messages
33.5 Branching: Working in Parallel
33.6 Remote Repositories and Collaboration
33.7 Virtual Environments: Capturing Your Software Stack
33.8 The .gitignore File: What NOT to Track
33.9 Setting Random Seeds: The Easiest Reproducibility Win
33.10 Writing a README: Your Project's Front Door
33.11 Project Milestone: Setting Up Your Project Repository
33.12 Collaboration in Practice: Working with a Team
33.13 Putting It All Together: The Reproducibility Checklist
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams

"Your most important collaborator is your future self — and they won't remember why you did what you did." — Every data scientist, approximately six months into their career

Chapter Overview

Here is a scenario that has played out thousands of times in data science teams around the world:

A data scientist — let's call him Alex — finishes an analysis, saves the Jupyter notebook on his laptop, and presents the results to the team. Everyone is impressed. The analysis shows a promising pattern in customer behavior. The VP of Product asks Alex to extend the analysis next quarter.

Three months later, Alex opens the notebook. He cannot get it to run. The pandas version has been updated, and a function he used has been deprecated. He cannot remember which version of the dataset he used — was it the one before or after the marketing team corrected the regional labels? He tries installing the old library versions, but he does not remember which versions they were. The notebook produces different numbers than his original presentation, and he does not know whether the discrepancy is due to the code, the data, or the libraries.

Alex spends two days trying to reproduce his own work. He eventually gives up and starts over.

This story is not hypothetical. It happens constantly. And it is entirely preventable — with tools and practices you will learn in this chapter.

Reproducibility means that someone else (or your future self) can take your code, your data, and your instructions, and produce the same results. Collaboration means that multiple people can work on the same project without overwriting each other's work, losing track of changes, or descending into chaos.

Both require tools. The most important tool is git — a version control system that tracks every change to every file in a project, lets you go back to any previous version, and enables multiple people to work simultaneously without conflicts. Alongside git, you will learn about virtual environments (which capture the exact software versions your code needs), documentation (which explains your project to others), and workflow practices (which make teamwork functional).

These are not glamorous topics. There are no cool visualizations in this chapter. There are no statistical breakthroughs. But the skills you learn here are the ones that separate a student project that dies on a laptop from a professional project that lives, grows, and serves its purpose.

In this chapter, you will learn to:

Explain why reproducibility matters for science and for practice (all paths)
Initialize a git repository and perform the core workflow (all paths)
Write clear commit messages (all paths)
Create and merge branches for parallel work (all paths)
Create virtual environments and manage dependencies (all paths)
Write a requirements.txt or environment.yml file (all paths)
Write a README that helps others understand your project (all paths)
Set random seeds for reproducible analyses (all paths)

Threshold Concept Alert: Reproducibility and version control feel like overhead — extra work that slows you down. The threshold moment comes when you realize they actually speed you up, because you stop losing work, stop breaking code, and stop spending days trying to figure out what you did three months ago.

33.1 The Reproducibility Crisis

In 2012, a biotechnology company called Amgen attempted to reproduce 53 "landmark" preclinical cancer studies — studies published in top journals that had influenced drug development strategies. They could reproduce the findings of only 6. That is an 11% success rate for studies that had passed peer review and shaped medical research.

This was not an isolated finding. In psychology, the Reproducibility Project attempted to replicate 100 published studies and found that only 36% produced the same results. In economics, a 2016 study found that only 61% of results from top journals could be replicated.

This is the reproducibility crisis: a widespread failure of published research to produce the same results when the study is repeated.

Why Can't We Reproduce Results?

The reasons are varied, but several are directly relevant to data science:

Software environment differences. A script that produces one result with pandas 1.3 may produce a different result with pandas 2.0. Function behavior changes, default parameters change, and even numerical precision can differ between library versions. If you do not record which versions you used, reproduction is a matter of luck.

Data versioning. Datasets change. Corrections are applied, rows are added, formats are updated. If you do not record which version of the data you used, your analysis may not be repeatable even with the same code.

Randomness. Many data science techniques involve randomness: train/test splits, random forest models, bootstrap samples, k-fold cross-validation. Without setting a random seed, these operations produce different results every time.

Undocumented decisions. During an analysis, you make dozens of decisions: which rows to exclude, how to handle missing values, which features to include, which hyperparameters to set. If these decisions are not documented, the analysis cannot be reproduced because a future analyst would make different decisions.

"Works on my machine." Code that runs on your laptop with your specific configuration may not run on anyone else's computer. File paths, operating systems, installed libraries, and environment variables all differ.

Why Reproducibility Matters

Reproducibility is not just a scientific ideal — it has practical consequences:

Trust. If your results cannot be reproduced, why should anyone believe them? Reproducibility is the foundation of credibility.
Collaboration. If a teammate cannot run your code, they cannot build on your work.
Debugging. If you cannot reproduce a result, you cannot diagnose why it changed.
Compliance. In regulated industries (healthcare, finance), reproducibility is a legal requirement. You must be able to show how you arrived at a decision.
Ethics. As we discussed in Chapter 32, transparent and reproducible work is a check against bias and errors. Work that cannot be verified cannot be challenged.

33.2 Version Control: What It Is and Why You Need It

Version control is a system that records changes to files over time, allowing you to recall specific versions later. It is like "Track Changes" in a word processor, but far more powerful — it works with any type of file, tracks changes across an entire project, and supports multiple people working simultaneously.

The Problem Version Control Solves

Without version control, you end up with something like this:

analysis_v1.ipynb
analysis_v2.ipynb
analysis_v2_final.ipynb
analysis_v2_final_ACTUAL_FINAL.ipynb
analysis_v2_final_ACTUAL_FINAL_fixed.ipynb
analysis_v2_final_ACTUAL_FINAL_fixed_v3.ipynb

This is not a joke. This is how most people manage files before they learn version control. And it is a disaster:

Which file is the current version? (Is "fixed_v3" newer than "ACTUAL_FINAL"?)
What changed between versions? (You would have to open both files and compare them manually)
Can you go back to a specific version? (Maybe, if you can figure out which file it is)
Can two people work on the file simultaneously? (No — one will overwrite the other)

Version control solves all of these problems by maintaining a complete history of every change, with metadata about who made the change, when, and why.

Enter Git

Git is the most widely used version control system in the world. It was created in 2005 by Linus Torvalds (who also created Linux) and is now the de facto standard for software development and increasingly for data science.

Git tracks changes to files in a repository (or "repo") — a project directory that git is monitoring. Every time you save a meaningful state of your project, you create a commit — a snapshot of all the files at that point in time. Each commit has a unique identifier, a timestamp, an author, and a message describing what changed.

Think of commits as save points in a video game. At any time, you can go back to any previous save point. You can compare the current state with any past state. You can see who changed what and when.

Let's learn by doing.

33.3 Getting Started with Git: The Core Workflow

Installing Git

Git may already be installed on your system. Open a terminal (or command prompt) and type:

git --version

If you see a version number (e.g., git version 2.39.2), you are ready. If not, install git from the official website (git-scm.com) or through your package manager.

You will also need to configure your identity (git uses this for commit metadata):

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

Creating a Repository

Let's create a repository for our vaccination analysis project.

# Navigate to your project directory
cd ~/projects/vaccination-analysis

# Initialize a git repository
git init

This creates a hidden .git directory that stores all of git's internal data. Your project files are unchanged — git is now watching them.

The Three Zones

Git has three "zones" that files can be in:

Working Directory    →    Staging Area    →    Repository
(your files)              (ready to commit)    (committed history)
     |                         |                      |
   git add ──────────────→   files             git commit ──→
     |                    staged here            saved here

Working directory: The files on your disk. This is where you edit, add, and delete files as you work.
Staging area: A holding zone for changes you want to include in your next commit. You explicitly choose which changes to stage.
Repository: The committed history. Once a change is committed, it is permanently recorded.

The Core Workflow

Here is the daily workflow you will use:

Step 1: Check the status

git status

This shows you which files have changed, which are staged, and which are untracked (new files git doesn't know about yet). Get in the habit of running git status frequently — it is your dashboard.

Step 2: Add files to the staging area

# Stage a specific file
git add analysis.ipynb

# Stage all changed files
git add .

# Stage specific files
git add data/clean_vaccination_data.csv analysis.ipynb

Step 3: Commit the staged changes

git commit -m "Add initial vaccination rate analysis with rural/urban comparison"

The -m flag lets you include a commit message inline. The message should explain what you changed and why.

Step 4: View the history

# See the commit log
git log

# See a compact log
git log --oneline

# See what changed in the last commit
git diff HEAD~1

Seeing What Changed

One of git's most valuable features is the ability to see exactly what changed between versions.

# See unstaged changes (what you've modified but haven't staged)
git diff

# See staged changes (what will be in the next commit)
git diff --staged

# See changes between two commits
git diff abc123 def456

# See what changed in a specific file
git diff analysis.ipynb

The diff output shows added lines with a + prefix and removed lines with a - prefix. This is incredibly useful for reviewing your own work ("wait, did I change that line?") and for understanding what others changed.

33.4 Writing Good Commit Messages

A commit message is a gift to your future self. When you look at the commit history six months from now, the messages are the only clue you will have about what you were thinking and why you made each change.

Bad Commit Messages

"update"
"fix"
"stuff"
"more changes"
"asdfasdf"
"final version"
"final final version"

These messages tell you nothing. You might as well not have written them.

Good Commit Messages

"Clean vaccination data: remove rows with missing country codes and standardize date format"
"Add rural vs urban comparison chart with annotation for pandemic period"
"Fix bug in vaccination rate calculation: was dividing by total population instead of eligible population"
"Add requirements.txt with pinned library versions"

Commit Message Guidelines

Start with a verb in the imperative mood: "Add," "Fix," "Remove," "Update," "Refactor." Think of it as completing the sentence "This commit will..."
Keep the first line under 72 characters. This is a convention that ensures the message displays properly in most tools.
Explain the "why," not just the "what." "Fix vaccination rate calculation" is fine, but "Fix vaccination rate calculation: was dividing by total population instead of eligible population" is much better — it explains what was wrong.
Use the body for details when needed. Leave a blank line after the first line and add as much context as necessary.

git commit -m "Fix rural vaccination rate calculation

The previous calculation divided by total county population, but
vaccination rates should use eligible population (children under 5).
This affected rural counties disproportionately because they have
lower proportions of children. Rates change by 2-5 percentage points
for most rural counties."

How Often Should You Commit?

Commit whenever you complete a meaningful unit of work. Not after every single keystroke, but not after a week of changes either. Good commit points:

You finished cleaning a dataset
You added a new analysis or visualization
You fixed a bug
You refactored code for clarity
You added documentation

A good rule of thumb: if you would be frustrated to lose the work, commit it.

33.5 Branching: Working in Parallel

So far, we have been working on a single line of history. But what if you want to try something experimental without risking your working code? What if two people want to work on different features simultaneously?

This is what branches are for.

What Is a Branch?

A branch is an independent line of development. When you create a branch, you get a copy of your project that you can modify without affecting the main line. When you are done, you can merge the branch back into the main line — or discard it if the experiment did not work out.

main:      A ──── B ──── C ──── D ──── E ──── F (merge)
                         \                   /
feature:                  C' ── D' ── E' ──-

In this diagram, someone branched off at commit C, made several changes (C', D', E'), and then merged those changes back into the main line at commit F.

Creating and Switching Branches

# Create a new branch
git branch feature-rural-analysis

# Switch to the new branch
git checkout feature-rural-analysis

# Shortcut: create AND switch in one command
git checkout -b feature-rural-analysis

Now any commits you make will be recorded on the feature-rural-analysis branch, leaving the main branch unchanged.

Merging Branches

When your work on a branch is complete, you merge it back into the main branch:

# Switch back to main
git checkout main

# Merge the feature branch into main
git merge feature-rural-analysis

If the same lines of the same file were modified on both branches, git will flag a merge conflict that you need to resolve manually. This sounds intimidating, but it is usually straightforward — git shows you both versions, and you choose which one to keep (or combine them).

When to Use Branches

Experimental work. Want to try a different modeling approach? Branch. If it works, merge. If not, delete the branch.
Features. Working on a new analysis while the existing one needs to stay stable? Branch.
Collaboration. Each team member works on their own branch, then merges when ready.
Bug fixes. Fix a bug on a branch so you can test it before it affects the main code.

33.6 Remote Repositories and Collaboration

So far, everything has been local — on your computer. To collaborate with others, you need a remote repository — a copy of the repository hosted on a server that everyone can access.

GitHub, GitLab, and Bitbucket

The most popular platforms for hosting git repositories are GitHub, GitLab, and Bitbucket. They provide:

A remote location for your repository
Web-based interfaces for viewing code, history, and diffs
Collaboration features: pull requests, code review, issue tracking
Access control: who can read, who can write

For this course, we will use GitHub, but the concepts apply to all platforms.

Connecting to a Remote

# Add a remote repository (usually called "origin")
git remote add origin https://github.com/yourusername/vaccination-analysis.git

# Push your local commits to the remote
git push -u origin main

# Pull changes from the remote (get other people's work)
git pull origin main

Pull Requests: The Collaboration Workflow

In a team setting, the standard workflow is:

Create a branch for your work
Make your changes and commit them
Push the branch to the remote repository
Open a pull request (PR) — a request to merge your branch into the main branch
Team members review your code, suggest changes, and approve
The branch is merged into main

Pull requests are the heart of collaborative data science. They provide:

Visibility: Everyone can see what changes are proposed
Review: Team members can examine the code before it is merged
Discussion: Comments and questions can be attached to specific lines of code
Quality control: Bad code, bugs, and mistakes can be caught before they enter the main codebase

Code Review: The Most Underrated Practice

Code review is the practice of having someone else read your code before it is merged. It is the single most effective quality control practice in software development, and it is equally valuable in data science.

What code reviewers look for:

Correctness: Does the code do what it claims to do? Are there bugs?
Clarity: Can someone else understand the code? Are variable names meaningful?
Methodology: Are the statistical methods appropriate? Are assumptions valid?
Reproducibility: Are random seeds set? Are dependencies documented?
Ethics: Could this analysis harm anyone? Are there representation gaps?

Code review is not about criticism — it is about collective ownership. When two people have reviewed a piece of code, both are responsible for its quality. This creates a culture of shared standards.

33.7 Virtual Environments: Capturing Your Software Stack

Remember Alex from the beginning of this chapter? His notebook broke because the library versions changed. Virtual environments prevent this.

The Problem

Your computer has Python and dozens of libraries installed. When you install a new library or update an existing one, it affects every project on your machine. Project A might need pandas 1.5, but Project B might need pandas 2.0. Without isolation, you cannot have both.

The Solution: Virtual Environments

A virtual environment is an isolated Python installation with its own set of packages. Each project gets its own environment with its own library versions. Updating a library in one environment does not affect any other environment.

Creating Environments with conda

If you are using Anaconda or Miniconda (which we recommended in Chapter 2), conda is the tool for environment management:

# Create a new environment with a specific Python version
conda create --name vaccination-project python=3.11

# Activate the environment
conda activate vaccination-project

# Install packages
conda install pandas numpy matplotlib seaborn scipy jupyter

# See what's installed
conda list

# Deactivate when done
conda deactivate

Creating Environments with venv

If you prefer the standard Python approach:

# Create a virtual environment
python -m venv venv

# Activate it (macOS/Linux)
source venv/bin/activate

# Activate it (Windows)
venv\Scripts\activate

# Install packages
pip install pandas numpy matplotlib seaborn scipy jupyter

# Deactivate
deactivate

Saving Your Environment

Once your project is working, save the exact library versions so others can recreate the environment:

With pip:

pip freeze > requirements.txt

This creates a file like:

matplotlib==3.8.2
numpy==1.26.3
pandas==2.1.4
scipy==1.12.0
seaborn==0.13.1

With conda:

conda env export > environment.yml

This creates a YAML file with all packages and versions.

Recreating an Environment

When someone else (or future you) wants to work on the project:

With pip:

pip install -r requirements.txt

With conda:

conda env create -f environment.yml

This installs the exact same versions of every library, ensuring that the code runs the same way it did when you wrote it.

Which Files to Include in Your Repository

vaccination-analysis/
├── .gitignore              # Files git should ignore
├── README.md               # Project description and setup instructions
├── requirements.txt        # Library dependencies (pip)
├── environment.yml         # Library dependencies (conda)
├── data/
│   ├── raw/                # Original, unmodified data files
│   └── processed/          # Cleaned, transformed data
├── notebooks/
│   ├── 01-data-cleaning.ipynb
│   ├── 02-exploration.ipynb
│   └── 03-analysis.ipynb
├── src/                    # Reusable Python scripts/modules
│   └── data_utils.py
└── results/
    ├── figures/
    └── reports/

33.8 The .gitignore File: What NOT to Track

Not everything in your project directory should be tracked by git. Large data files, temporary files, environment directories, and sensitive information should be excluded.

Create a file called .gitignore in the root of your repository:

# Python
__pycache__/
*.pyc
*.pyo

# Jupyter
.ipynb_checkpoints/

# Virtual environments
venv/
.conda/

# Data (too large for git — store elsewhere)
data/raw/*.csv
data/raw/*.xlsx
*.h5
*.parquet

# OS files
.DS_Store
Thumbs.db

# Secrets (NEVER commit these)
.env
credentials.json
*.key

# IDE settings
.vscode/
.idea/

The .gitignore tells git to pretend these files do not exist. They will not be staged, committed, or pushed, even if you run git add ..

What About Large Data Files?

Git is designed for code (text files), not large data files. If your dataset is more than a few megabytes, do not put it in git. Instead:

Store it in a shared cloud location (Google Drive, S3, Azure Blob)
Document the data source and download instructions in your README
Include a small sample of the data in the repository for testing
Consider git-lfs (Large File Storage) if you need to version large files

33.9 Setting Random Seeds: The Easiest Reproducibility Win

Many data science operations involve randomness: splitting data into train and test sets, initializing model weights, bootstrapping confidence intervals, k-fold cross-validation. Without controlling the randomness, you will get different results every time.

The solution is simple: set a random seed.

import numpy as np
import random

# Set seeds at the beginning of your notebook
np.random.seed(42)
random.seed(42)

# Now random operations are reproducible
# This will produce the same split every time:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

The number 42 is conventional (a reference to The Hitchhiker's Guide to the Galaxy), but any integer works. What matters is that you set it and document it.

Where to set seeds:

At the top of every notebook or script
In every function that uses randomness
As the random_state parameter in scikit-learn functions

What to document:

The seed value you used
Any library-specific seeding (some libraries have their own random number generators)

This is the single easiest thing you can do for reproducibility, and it takes approximately five seconds.

33.10 Writing a README: Your Project's Front Door

A README is a document that introduces your project to anyone encountering it for the first time. It is typically the first file someone reads when they open your repository, and it is the most important piece of documentation you will write.

What a Good README Contains

# Vaccination Rate Analysis: Rural vs. Urban Trends

## Overview
This project analyzes county-level childhood vaccination data
(2015-2023) to investigate diverging trends between rural and urban
areas, with a focus on the impact of community health clinics
during the COVID-19 disruption period.

## Key Findings
- Rural vaccination rates declined by 11 percentage points between
  2019 and 2022, compared to 3 points in urban areas
- Counties with community health clinics maintained significantly
  higher rates during the pandemic
- The estimated cost per additional vaccination through clinics ($47)
  is less than half the cost of awareness campaigns ($112)

## Setup

### Prerequisites
- Python 3.11+
- conda or pip

### Installation

# Clone the repository
git clone https://github.com/username/vaccination-analysis.git
cd vaccination-analysis

# Create the environment
conda env create -f environment.yml
conda activate vaccination-project

# Or, using pip:
pip install -r requirements.txt

### Data
Download the CDC county-level immunization data from [source URL].
Place the file in `data/raw/`.

## Project Structure

vaccination-analysis/
├── README.md
├── requirements.txt
├── data/                # Not tracked by git
├── notebooks/
│   ├── 01-data-cleaning.ipynb
│   ├── 02-exploration.ipynb
│   └── 03-analysis.ipynb
├── src/
│   └── data_utils.py
└── results/
    └── figures/

## Usage
Run the notebooks in numerical order:
1. `01-data-cleaning.ipynb` — loads raw data, cleans, and saves
   processed version
2. `02-exploration.ipynb` — exploratory analysis and initial
   visualizations
3. `03-analysis.ipynb` — main analysis and figure generation

## Contributing
See [CONTRIBUTING.md] for guidelines on how to contribute.

## License
This project is licensed under the MIT License.

## Authors
- Your Name (@username)

README Essentials

At a minimum, your README should answer:

What is this project? (one paragraph)
How do I set it up? (installation steps)
How do I run it? (usage instructions)
What data does it need? (data sources and setup)
What does the directory structure look like? (navigation guide)

Beyond these essentials, consider adding:

Key findings or results
Known limitations
Contact information or contribution guidelines
License information

33.11 Project Milestone: Setting Up Your Project Repository

Let's apply everything from this chapter to the vaccination analysis project you have been building throughout the book.

Step 1: Initialize the Repository

cd ~/projects/vaccination-analysis
git init

Step 2: Create the .gitignore

Create a .gitignore file that excludes data files, environment directories, and Jupyter checkpoints.

Step 3: Create the Directory Structure

Organize your project files into a clear structure:

vaccination-analysis/
├── .gitignore
├── README.md
├── requirements.txt
├── notebooks/
│   ├── 01-data-cleaning.ipynb
│   ├── 02-exploration.ipynb
│   └── 03-analysis.ipynb
├── src/
└── results/
    └── figures/

Step 4: Write the requirements.txt

Activate your environment and freeze your dependencies:

pip freeze > requirements.txt

Or write it manually with the libraries you actually use:

pandas>=2.0
numpy>=1.24
matplotlib>=3.7
seaborn>=0.12
scipy>=1.10
jupyter>=1.0

Step 5: Write the README

Using the template from Section 33.10, write a README that describes your vaccination analysis project. Include:

What the project analyzes
How to set up the environment
How to obtain the data
How to run the analysis
What results to expect

Step 6: Make Your First Commit

git add .gitignore README.md requirements.txt
git add notebooks/ src/ results/
git status  # Review what you're about to commit
git commit -m "Initialize vaccination analysis project with structure, README, and dependencies"

Step 7: Set Up a Remote (Optional)

If you have a GitHub account:

# Create a repository on GitHub first (via the web interface)
git remote add origin https://github.com/yourusername/vaccination-analysis.git
git push -u origin main

Your project is now version-controlled, documented, and shareable. Anyone who finds your repository can understand what it does, set up the environment, and run the analysis. Your future self, six months from now, will be grateful.

33.12 Collaboration in Practice: Working with a Team

Individual version control is valuable. Team version control is transformative. Here are the practices that make team data science work.

The Feature Branch Workflow

In a team, nobody works directly on the main branch. Instead:

Each person creates a branch for their work (feature/rural-analysis, fix/date-parsing, update/readme)
They make changes and commit on their branch
They push the branch to the remote and open a pull request
Another team member reviews the code
After approval, the branch is merged into main

This workflow ensures that main always contains working, reviewed code. No one's experimental work can break the shared codebase.

Dealing with Conflicts

When two people modify the same file, git may produce a merge conflict. Git marks the conflicting sections:

<<<<<<< HEAD
# Your version
vaccination_rate = vaccinated / eligible_population * 100
=======
# Their version
vaccination_rate = vaccinated / total_population * 100
>>>>>>> feature/fix-rate-calc

To resolve the conflict:

Open the file and find the conflict markers (<<<<<<<, =======, >>>>>>>)
Decide which version is correct (or combine them)
Remove the conflict markers
Save, stage, and commit

Conflicts are a normal part of collaboration. They are not errors — they are signals that two people touched the same code, and a human decision is needed.

Communication Norms

Technical tools are not enough. Teams also need communication practices:

Agree on a branching strategy. Does each person work on one branch, or does each feature get a branch?
Write descriptive PR descriptions. Explain what changed and why, not just "updated code."
Review promptly. A pull request that sits for a week slows everyone down.
Use issues for task tracking. GitHub Issues (or similar tools) let teams track what needs to be done, what is in progress, and what is complete.
Document conventions. How do you name branches? How do you format commit messages? Put these decisions in a CONTRIBUTING.md file.

33.13 Putting It All Together: The Reproducibility Checklist

Before you share any analysis — whether for a class, a team, or the world — run through this checklist:

Version Control: - [ ] The project is in a git repository - [ ] All changes are committed with descriptive messages - [ ] The .gitignore excludes data files, environments, and secrets

Environment: - [ ] Dependencies are recorded in requirements.txt or environment.yml - [ ] Library versions are pinned (exact versions, not just package names) - [ ] The environment can be recreated from scratch on a clean machine

Data: - [ ] The data source is documented (URL, download instructions, date accessed) - [ ] Raw data is never modified — processing steps create new files - [ ] If data is too large for git, download instructions are in the README

Code: - [ ] Random seeds are set for all stochastic operations - [ ] The analysis runs top-to-bottom without manual intervention - [ ] File paths use relative paths (not absolute paths like /Users/alex/data/)

Documentation: - [ ] The README explains what the project does, how to set it up, and how to run it - [ ] Key decisions and assumptions are documented (in the notebook or README) - [ ] Results are clearly labeled and connected to the code that produced them

If you can check every box, your analysis is reproducible. Someone who has never seen your project can clone the repository, create the environment, run the code, and get the same results. That is the standard to aim for.

Chapter Summary

This chapter covered the tools and practices that make data science reproducible and collaborative. These are not theoretical concepts — they are daily practices that separate professional data science from notebook-on-a-laptop data science.

Version control with git tracks every change to every file, lets you go back to any previous version, and enables team collaboration without chaos. The core workflow — add, commit, push, pull — becomes second nature with practice.

Branching and pull requests enable parallel work and code review. Nobody works directly on main; everyone's work is reviewed before it is merged.

Virtual environments isolate project dependencies so that each project has its own library versions. requirements.txt and environment.yml let others recreate your environment exactly.

Documentation — the README, commit messages, and inline comments — is communication with your future self and your collaborators. Write as if the reader has never seen your project before, because they have not.

Random seeds are the simplest reproducibility practice: set them at the top of every notebook.

The reproducibility checklist is your pre-flight check before sharing any analysis.

These skills may not feel as exciting as building a machine learning model or creating a stunning visualization. But they are the skills that make the exciting work last — that turn a one-time analysis into a living project that others can build on, verify, and trust.

You are ready for Chapter 34, where you will learn how to build a portfolio that showcases your data science skills to potential employers and collaborators. Everything you have learned in this chapter — git repositories, READMEs, reproducible code — is the foundation of a professional portfolio.

Learning Objectives

In This Chapter

Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams

Chapter Overview

33.1 The Reproducibility Crisis

Why Can't We Reproduce Results?

Why Reproducibility Matters

33.2 Version Control: What It Is and Why You Need It

The Problem Version Control Solves

Enter Git

33.3 Getting Started with Git: The Core Workflow

Installing Git

Creating a Repository

The Three Zones

The Core Workflow

Seeing What Changed

33.4 Writing Good Commit Messages

Bad Commit Messages

Good Commit Messages

Commit Message Guidelines

How Often Should You Commit?

33.5 Branching: Working in Parallel

What Is a Branch?

Creating and Switching Branches

Merging Branches

When to Use Branches

33.6 Remote Repositories and Collaboration

GitHub, GitLab, and Bitbucket

Connecting to a Remote

Pull Requests: The Collaboration Workflow

Code Review: The Most Underrated Practice

33.7 Virtual Environments: Capturing Your Software Stack

The Problem

The Solution: Virtual Environments

Creating Environments with conda

Creating Environments with venv

Saving Your Environment

Recreating an Environment

Which Files to Include in Your Repository

33.8 The .gitignore File: What NOT to Track

What About Large Data Files?

33.9 Setting Random Seeds: The Easiest Reproducibility Win

33.10 Writing a README: Your Project's Front Door

What a Good README Contains

README Essentials

33.11 Project Milestone: Setting Up Your Project Repository

Step 1: Initialize the Repository

Step 2: Create the .gitignore

Step 3: Create the Directory Structure

Step 4: Write the requirements.txt

Step 5: Write the README

Step 6: Make Your First Commit

Step 7: Set Up a Remote (Optional)

33.12 Collaboration in Practice: Working with a Team

The Feature Branch Workflow

Dealing with Conflicts

Communication Norms

33.13 Putting It All Together: The Reproducibility Checklist

Chapter Summary

Related Reading