Key Takeaways: Reproducibility and Collaboration: Git, Environments, and Working with Teams
This is your reference card for Chapter 33. Keep it handy as you start new projects — the practices here should become second nature.
The Threshold Concept
Code without version control and documentation is a liability, not an asset.
Reproducibility and collaboration require deliberate tooling and practice. The tools feel like overhead at first, but they save far more time than they cost — because you stop losing work, stop breaking code, and stop spending days trying to figure out what you did three months ago.
Git Core Workflow
# Check what has changed
git status
# Stage changes for commit
git add filename.py # specific file
git add . # all changes
# Commit with a descriptive message
git commit -m "Add rural vaccination analysis with pandemic comparison"
# View history
git log --oneline
# See what changed
git diff # unstaged changes
git diff --staged # staged changes
Commit Message Guidelines
| Do | Don't |
|---|---|
| Start with a verb: Add, Fix, Update, Remove | Write "update" or "changes" |
| Keep the first line under 72 characters | Write a paragraph on one line |
| Explain why, not just what | Write "fixed stuff" |
| Reference related issues if applicable | Use commit messages as a diary |
Good examples:
Add rural vs urban vaccination comparison chart
Fix rate calculation: divide by eligible population, not total
Remove deprecated pandas .append() calls, use pd.concat() instead
Update requirements.txt to pin scipy version
Branching Quick Reference
# Create and switch to a new branch
git checkout -b feature/rural-analysis
# Switch between branches
git checkout main
git checkout feature/rural-analysis
# Merge a branch into main
git checkout main
git merge feature/rural-analysis
# Delete a merged branch
git branch -d feature/rural-analysis
When to branch: - New feature or analysis - Experimental work - Bug fixes - Each team member's parallel work
Remote Repository Commands
# Connect to a remote
git remote add origin https://github.com/user/repo.git
# Push changes to remote
git push -u origin main
# Pull changes from remote
git pull origin main
# Push a branch
git push -u origin feature/my-branch
Virtual Environment Quick Reference
With conda:
conda create --name project python=3.11
conda activate project
conda install pandas numpy matplotlib
conda deactivate
With venv:
python -m venv venv
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windows
pip install pandas numpy matplotlib
deactivate
Save and recreate:
# Save
pip freeze > requirements.txt
# Recreate
pip install -r requirements.txt
Essential .gitignore Entries
# Python
__pycache__/
*.pyc
# Jupyter
.ipynb_checkpoints/
# Virtual environments
venv/
.conda/
# Data (too large for git)
data/raw/
*.csv
*.parquet
# Secrets (NEVER commit)
.env
credentials.json
# OS files
.DS_Store
Thumbs.db
Random Seeds
Set at the top of every notebook:
import numpy as np
import random
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
Use random_state=RANDOM_SEED in all scikit-learn functions:
train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)
RandomForestClassifier(n_estimators=100, random_state=RANDOM_SEED)
KFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)
README Template
# Project Title
## Overview
What this project does (1-2 sentences)
## Key Findings
- Finding 1
- Finding 2
- Finding 3
## Setup
### Prerequisites
- Python 3.11+
- conda or pip
### Installation
git clone <url>
pip install -r requirements.txt
### Data
Where to get the data and where to put it
## Usage
How to run the analysis (in what order)
## Project Structure
Directory tree showing key files
## Authors
Who worked on this
Project Structure Template
project-name/
├── .gitignore
├── README.md
├── requirements.txt
├── data/
│ ├── raw/ # Original data (not in git)
│ └── processed/ # Cleaned data (not in git)
├── notebooks/
│ ├── 01-cleaning.ipynb
│ ├── 02-exploration.ipynb
│ └── 03-analysis.ipynb
├── src/ # Reusable code
└── results/
├── figures/
└── reports/
The Reproducibility Checklist
Before sharing any analysis:
Version Control: - [ ] Project is in a git repository - [ ] All changes are committed with descriptive messages - [ ] .gitignore excludes data, environments, and secrets
Environment: - [ ] Dependencies in requirements.txt or environment.yml - [ ] Versions are pinned - [ ] Environment can be recreated on a clean machine
Data: - [ ] Source is documented - [ ] Raw data is never modified - [ ] Download instructions in README
Code: - [ ] Random seeds set everywhere - [ ] Analysis runs top-to-bottom - [ ] File paths are relative
Documentation: - [ ] README explains setup and usage - [ ] Key decisions are documented - [ ] Results are connected to code
Team Collaboration Workflow
1. Create a branch → git checkout -b feature/my-work
2. Do your work → edit, add, commit (repeat)
3. Push the branch → git push -u origin feature/my-work
4. Open a PR → on GitHub, with description
5. Code review → teammate reviews and approves
6. Merge → merge PR into main
7. Clean up → delete the branch
What You Should Be Able to Do Now
- [ ] Explain why reproducibility matters for science and practice
- [ ] Initialize a git repository and perform add/commit/push
- [ ] Write clear, informative commit messages
- [ ] Create branches, merge them, and resolve conflicts
- [ ] Create virtual environments with conda or venv
- [ ] Generate requirements.txt for reproducible environments
- [ ] Write a README that lets others set up and run your project
- [ ] Set random seeds for all stochastic operations
- [ ] Use .gitignore to exclude data, environments, and secrets
- [ ] Participate in code review via pull requests
The Three-Minute Setup
When starting ANY new project, do these three things immediately:
# 1. Initialize git
git init
echo "__pycache__/\n.ipynb_checkpoints/\nvenv/\n.env" > .gitignore
git add .gitignore
git commit -m "Initialize repository with .gitignore"
# 2. Create environment and pin dependencies
python -m venv venv && source venv/bin/activate
pip install pandas numpy matplotlib seaborn jupyter
pip freeze > requirements.txt
git add requirements.txt
git commit -m "Add requirements.txt with initial dependencies"
# 3. Write a minimal README
echo "# Project Title\n\nOne-sentence description." > README.md
git add README.md
git commit -m "Add README"
Three minutes. Three commits. Your project is now reproducible and shareable. Everything else builds on this foundation.
You are ready for Chapter 34, where you will learn to build a portfolio that showcases your data science skills. The version-controlled, well-documented projects you create using the practices from this chapter ARE your portfolio — each one is a demonstration of professional competence.