Key Takeaways: Reproducibility and Collaboration: Git, Environments, and Working with Teams

Contributors to Introduction to Data Science

Key Takeaways: Reproducibility and Collaboration: Git, Environments, and Working with Teams

This is your reference card for Chapter 33. Keep it handy as you start new projects — the practices here should become second nature.

The Threshold Concept

Code without version control and documentation is a liability, not an asset.

Reproducibility and collaboration require deliberate tooling and practice. The tools feel like overhead at first, but they save far more time than they cost — because you stop losing work, stop breaking code, and stop spending days trying to figure out what you did three months ago.

Git Core Workflow

# Check what has changed
git status

# Stage changes for commit
git add filename.py          # specific file
git add .                    # all changes

# Commit with a descriptive message
git commit -m "Add rural vaccination analysis with pandemic comparison"

# View history
git log --oneline

# See what changed
git diff                     # unstaged changes
git diff --staged            # staged changes

Commit Message Guidelines

Do	Don't
Start with a verb: Add, Fix, Update, Remove	Write "update" or "changes"
Keep the first line under 72 characters	Write a paragraph on one line
Explain why, not just what	Write "fixed stuff"
Reference related issues if applicable	Use commit messages as a diary

Good examples:

Add rural vs urban vaccination comparison chart
Fix rate calculation: divide by eligible population, not total
Remove deprecated pandas .append() calls, use pd.concat() instead
Update requirements.txt to pin scipy version

Branching Quick Reference

# Create and switch to a new branch
git checkout -b feature/rural-analysis

# Switch between branches
git checkout main
git checkout feature/rural-analysis

# Merge a branch into main
git checkout main
git merge feature/rural-analysis

# Delete a merged branch
git branch -d feature/rural-analysis

When to branch: - New feature or analysis - Experimental work - Bug fixes - Each team member's parallel work

Remote Repository Commands

# Connect to a remote
git remote add origin https://github.com/user/repo.git

# Push changes to remote
git push -u origin main

# Pull changes from remote
git pull origin main

# Push a branch
git push -u origin feature/my-branch

Virtual Environment Quick Reference

With conda:

conda create --name project python=3.11
conda activate project
conda install pandas numpy matplotlib
conda deactivate

With venv:

python -m venv venv
source venv/bin/activate      # macOS/Linux
venv\Scripts\activate         # Windows
pip install pandas numpy matplotlib
deactivate

Save and recreate:

# Save
pip freeze > requirements.txt

# Recreate
pip install -r requirements.txt

Essential .gitignore Entries

# Python
__pycache__/
*.pyc

# Jupyter
.ipynb_checkpoints/

# Virtual environments
venv/
.conda/

# Data (too large for git)
data/raw/
*.csv
*.parquet

# Secrets (NEVER commit)
.env
credentials.json

# OS files
.DS_Store
Thumbs.db

Random Seeds

Set at the top of every notebook:

import numpy as np
import random

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

Use random_state=RANDOM_SEED in all scikit-learn functions:

train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)
RandomForestClassifier(n_estimators=100, random_state=RANDOM_SEED)
KFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

README Template

# Project Title

## Overview
What this project does (1-2 sentences)

## Key Findings
- Finding 1
- Finding 2
- Finding 3

## Setup
### Prerequisites
- Python 3.11+
- conda or pip

### Installation
git clone <url>
pip install -r requirements.txt

### Data
Where to get the data and where to put it

## Usage
How to run the analysis (in what order)

## Project Structure
Directory tree showing key files

## Authors
Who worked on this

Project Structure Template

project-name/
├── .gitignore
├── README.md
├── requirements.txt
├── data/
│   ├── raw/          # Original data (not in git)
│   └── processed/    # Cleaned data (not in git)
├── notebooks/
│   ├── 01-cleaning.ipynb
│   ├── 02-exploration.ipynb
│   └── 03-analysis.ipynb
├── src/              # Reusable code
└── results/
    ├── figures/
    └── reports/

The Reproducibility Checklist

Before sharing any analysis:

Version Control: - [ ] Project is in a git repository - [ ] All changes are committed with descriptive messages - [ ] .gitignore excludes data, environments, and secrets

Environment: - [ ] Dependencies in requirements.txt or environment.yml - [ ] Versions are pinned - [ ] Environment can be recreated on a clean machine

Data: - [ ] Source is documented - [ ] Raw data is never modified - [ ] Download instructions in README

Code: - [ ] Random seeds set everywhere - [ ] Analysis runs top-to-bottom - [ ] File paths are relative

Documentation: - [ ] README explains setup and usage - [ ] Key decisions are documented - [ ] Results are connected to code

Team Collaboration Workflow

1. Create a branch    →  git checkout -b feature/my-work
2. Do your work       →  edit, add, commit (repeat)
3. Push the branch    →  git push -u origin feature/my-work
4. Open a PR          →  on GitHub, with description
5. Code review        →  teammate reviews and approves
6. Merge              →  merge PR into main
7. Clean up           →  delete the branch

What You Should Be Able to Do Now

[ ] Explain why reproducibility matters for science and practice
[ ] Initialize a git repository and perform add/commit/push
[ ] Write clear, informative commit messages
[ ] Create branches, merge them, and resolve conflicts
[ ] Create virtual environments with conda or venv
[ ] Generate requirements.txt for reproducible environments
[ ] Write a README that lets others set up and run your project
[ ] Set random seeds for all stochastic operations
[ ] Use .gitignore to exclude data, environments, and secrets
[ ] Participate in code review via pull requests

The Three-Minute Setup

When starting ANY new project, do these three things immediately:

# 1. Initialize git
git init
echo "__pycache__/\n.ipynb_checkpoints/\nvenv/\n.env" > .gitignore
git add .gitignore
git commit -m "Initialize repository with .gitignore"

# 2. Create environment and pin dependencies
python -m venv venv && source venv/bin/activate
pip install pandas numpy matplotlib seaborn jupyter
pip freeze > requirements.txt
git add requirements.txt
git commit -m "Add requirements.txt with initial dependencies"

# 3. Write a minimal README
echo "# Project Title\n\nOne-sentence description." > README.md
git add README.md
git commit -m "Add README"

Three minutes. Three commits. Your project is now reproducible and shareable. Everything else builds on this foundation.

You are ready for Chapter 34, where you will learn to build a portfolio that showcases your data science skills. The version-controlled, well-documented projects you create using the practices from this chapter ARE your portfolio — each one is a demonstration of professional competence.