Chapter 33 Exercises: Reproducibility and Collaboration: Git, Environments, and Working with Teams

How to use these exercises: Part A tests conceptual understanding of version control and reproducibility. Part B is hands-on — you will actually use git, create environments, and write documentation. Part C involves team simulation and workflow design. Part D explores broader topics.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension

Important: Many of these exercises require you to use the command line. If you get stuck on a git command, git help <command> will show you the documentation, and git status will always tell you where you are.


Part A: Conceptual Understanding ⭐


Exercise 33.1Why reproducibility matters

For each scenario, explain why reproducibility is critical and what could go wrong without it:

  1. A pharmaceutical company uses a data science model to determine drug dosages.
  2. A student submits a data analysis for a class project, and the professor wants to verify the results.
  3. A data scientist leaves a company, and their replacement needs to update a quarterly report.
  4. A research paper claims that a new treatment reduces hospital readmissions by 15%.
  5. A machine learning model was trained six months ago and now needs to be retrained with new data.
Guidance 1. **Pharmaceutical dosage:** If the model's results cannot be reproduced, the dosage recommendations could be wrong — potentially harming patients. Regulatory agencies (FDA) require reproducible methodology. Without it, the drug cannot be approved. 2. **Class project:** If the professor cannot run the code (missing dependencies, hardcoded file paths, no random seeds), they cannot verify the results. The student's work may look correct but contain errors that only become apparent when someone else tries to run it. 3. **Employee departure:** If the departing data scientist did not document their work, use version control, or record their environment, the replacement may spend weeks reverse-engineering the quarterly report instead of improving it. This is one of the most common and costly reproducibility failures in industry. 4. **Research paper:** If the treatment claim cannot be reproduced by independent researchers, the finding may be wrong — potentially wasting millions of dollars in healthcare spending or, worse, directing patients toward an ineffective treatment. 5. **Model retraining:** If the original training environment is not documented, the retrained model may behave differently — not because of the new data, but because of different library versions, different random seeds, or different preprocessing steps.

Exercise 33.2Git vocabulary

Match each term with its definition:

Term Definition
1. Repository A. A request to merge one branch into another, with review
2. Commit B. A snapshot of your project at a point in time
3. Branch C. A project directory tracked by git
4. Merge D. A copy of a repository on a remote server
5. Pull request E. An independent line of development
6. Remote F. Combining changes from one branch into another
7. Staging area G. A holding zone for changes before committing
8. Clone H. Creating a local copy of a remote repository
Guidance 1-C, 2-B, 3-E, 4-F, 5-A, 6-D, 7-G, 8-H - A **repository** (C) is the project folder being tracked by git. - A **commit** (B) is a recorded snapshot — like a save point. - A **branch** (E) is a parallel line of development where you can work without affecting the main code. - A **merge** (F) combines changes from one branch into another. - A **pull request** (A) is a formal request to merge a branch, with opportunity for review and discussion. - A **remote** (D) is a server-hosted copy of the repository (e.g., on GitHub). - The **staging area** (G) is where you place changes before committing them. - A **clone** (H) creates a local copy of a remote repository on your machine.

Exercise 33.3Identifying reproducibility problems

For each code snippet, identify the reproducibility problem:

1.

df = pd.read_csv('C:/Users/alex/Desktop/data/vaccination_data.csv')

2.

train, test = train_test_split(data, test_size=0.2)

3.

import pandas  # no version specified anywhere in the project

4.

# I removed some outliers here
df = df[df['rate'] < 100]

5.

results = model.fit(X, y)
# model accuracy: 0.87 (I wrote this down on a sticky note)
Guidance 1. **Absolute file path.** This path exists only on Alex's computer. Anyone else (or Alex on a different machine) would need to change it. Use relative paths: `df = pd.read_csv('data/vaccination_data.csv')`. 2. **No random seed.** Without `random_state=42`, this will produce a different split every time, leading to different results. Add the `random_state` parameter. 3. **No pinned version.** Pandas 1.5 and pandas 2.0 behave differently. Without a `requirements.txt` with a pinned version (`pandas==2.1.4`), the code may break or produce different results on another machine. 4. **Undocumented decision.** Why 100? Is that a data error threshold? A domain-specific cutoff? Without explanation, no one can evaluate whether this decision is appropriate. Add a comment explaining the rationale. 5. **Results not connected to code.** If the model accuracy is on a sticky note, it cannot be verified. Results should be produced by code, saved to a file, or displayed in a notebook — never recorded manually.

Exercise 33.4The .gitignore file

For each file or directory, decide whether it should be tracked by git (YES) or excluded via .gitignore (NO), and explain why:

  1. analysis.ipynb
  2. data/raw/census_2020.csv (500MB)
  3. requirements.txt
  4. .env (contains API keys)
  5. __pycache__/
  6. README.md
  7. venv/ (virtual environment directory)
  8. results/figures/vaccination_trend.png
  9. .ipynb_checkpoints/
  10. src/data_utils.py
Guidance 1. **YES** — notebooks are core project files. 2. **NO** — too large for git. Document the data source in the README instead. 3. **YES** — essential for reproducibility. 4. **NO** — NEVER commit secrets, API keys, or credentials. This is a security risk. 5. **NO** — auto-generated Python bytecode. Not useful and clutters the repository. 6. **YES** — the project's front door. 7. **NO** — environment directories are large and machine-specific. The environment is recreated from `requirements.txt`. 8. **Depends** — small figures that are part of the results: YES. Large files or files easily regenerated by code: NO. 9. **NO** — Jupyter auto-save checkpoints. Clutter, not content. 10. **YES** — source code is a core project file.

Part B: Hands-On Practice ⭐⭐


Exercise 33.5Initialize a repository ⭐⭐

Create a new directory for a practice project, initialize a git repository, and make your first three commits:

  1. Commit 1: Add a README.md with a project title and one-sentence description
  2. Commit 2: Add a .gitignore file that excludes Python bytecode and Jupyter checkpoints
  3. Commit 3: Add a simple Python script or notebook that prints "Hello, reproducible world!"

After all three commits, run git log --oneline and verify you see three commits with descriptive messages.

Guidance
# Create and enter the project directory
mkdir practice-project
cd practice-project

# Initialize git
git init

# Create and commit the README
echo "# Practice Project" > README.md
echo "A practice repository for learning git." >> README.md
git add README.md
git commit -m "Add README with project description"

# Create and commit the .gitignore
echo "__pycache__/" > .gitignore
echo ".ipynb_checkpoints/" >> .gitignore
echo "*.pyc" >> .gitignore
git add .gitignore
git commit -m "Add .gitignore to exclude Python bytecode and Jupyter checkpoints"

# Create and commit a simple script
echo 'print("Hello, reproducible world!")' > hello.py
git add hello.py
git commit -m "Add initial hello script"

# Verify
git log --oneline
You should see three commits, newest first, each with the message you wrote.

Exercise 33.6Branching and merging ⭐⭐

Starting from the repository you created in Exercise 33.5:

  1. Create a branch called feature/add-analysis
  2. On that branch, add a new file called analysis.py that creates a small pandas DataFrame and prints its summary statistics
  3. Commit the new file on the branch
  4. Switch back to main and verify that analysis.py does not exist there
  5. Merge the branch into main
  6. Verify that analysis.py now exists on main
  7. Delete the branch (it has been merged, so it is no longer needed)
Guidance
# Create and switch to the branch
git checkout -b feature/add-analysis

# Create the analysis file
cat > analysis.py << 'EOF'
import pandas as pd

data = {'city': ['New York', 'Chicago', 'Houston', 'Phoenix'],
        'population': [8336817, 2693976, 2320268, 1680992],
        'vaccination_rate': [82.3, 78.1, 71.5, 68.9]}

df = pd.DataFrame(data)
print(df.describe())
EOF

# Commit
git add analysis.py
git commit -m "Add analysis script with city vaccination data summary"

# Switch to main and verify analysis.py is not there
git checkout main
ls analysis.py  # Should show "No such file"

# Merge
git merge feature/add-analysis

# Verify analysis.py exists
ls analysis.py  # Should show the file

# Delete the branch
git branch -d feature/add-analysis

Exercise 33.7Creating a virtual environment ⭐⭐

  1. Create a new virtual environment (using either conda or venv)
  2. Install pandas, numpy, and matplotlib
  3. Generate a requirements.txt file
  4. Deactivate the environment
  5. Delete the environment
  6. Recreate it using only the requirements.txt file
  7. Verify that the same packages are installed
Guidance **Using conda:**
# Create
conda create --name test-env python=3.11 -y
conda activate test-env

# Install
conda install pandas numpy matplotlib -y

# Save
pip freeze > requirements.txt

# Deactivate and remove
conda deactivate
conda env remove --name test-env

# Recreate
conda create --name test-env python=3.11 -y
conda activate test-env
pip install -r requirements.txt

# Verify
pip list
**Using venv:**
# Create
python -m venv test-env
source test-env/bin/activate  # or test-env\Scripts\activate on Windows

# Install
pip install pandas numpy matplotlib

# Save
pip freeze > requirements.txt

# Deactivate and remove
deactivate
rm -rf test-env

# Recreate
python -m venv test-env
source test-env/bin/activate
pip install -r requirements.txt

# Verify
pip list

Exercise 33.8Writing a README ⭐⭐

Write a README.md for a hypothetical data science project that analyzes air quality data from a city's sensor network. Include all essential sections:

  1. Project title and one-paragraph description
  2. Key findings (make up 3 plausible findings)
  3. Setup instructions (prerequisites, installation, data download)
  4. Project structure (directory tree)
  5. Usage instructions (how to run the analysis)
  6. License and author information
Guidance Your README should be clear enough that someone who has never seen the project can set it up and run it by following the instructions. Test this by reading it as if you are a stranger. Key qualities of a good README: - **Specific:** "Install Python 3.11 or later" is better than "Install Python" - **Complete:** Every step is included. Do not skip steps that feel "obvious" to you. - **Honest:** If there are limitations or known issues, mention them. - **Scannable:** Use headers, bullet points, and code blocks for easy reading.

Exercise 33.9Setting random seeds ⭐⭐

Write a Python script that demonstrates the importance of random seeds:

  1. Without a seed, generate two random samples and show they are different
  2. With a seed, generate two random samples and show they are identical
  3. Show that train_test_split produces different results without random_state and identical results with it
import numpy as np
from sklearn.model_selection import train_test_split

# Part 1: Without seed
sample_1 = np.random.randn(5)
sample_2 = np.random.randn(5)
print("Without seed:")
print(f"  Sample 1: {sample_1}")
print(f"  Sample 2: {sample_2}")
print(f"  Are they equal? {np.array_equal(sample_1, sample_2)}")

# Part 2: With seed
np.random.seed(42)
sample_3 = np.random.randn(5)
np.random.seed(42)
sample_4 = np.random.randn(5)
print("\nWith seed:")
print(f"  Sample 3: {sample_3}")
print(f"  Sample 4: {sample_4}")
print(f"  Are they equal? {np.array_equal(sample_3, sample_4)}")

# Part 3: train_test_split
X = np.arange(20).reshape(10, 2)
y = np.arange(10)

# Without random_state (run twice to see different results)
_, X_test_1, _, _ = train_test_split(X, y, test_size=0.3)
_, X_test_2, _, _ = train_test_split(X, y, test_size=0.3)
print(f"\nSplit without seed - same? {np.array_equal(X_test_1, X_test_2)}")

# With random_state
_, X_test_3, _, _ = train_test_split(X, y, test_size=0.3, random_state=42)
_, X_test_4, _, _ = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"Split with seed - same? {np.array_equal(X_test_3, X_test_4)}")

Run this script and verify the output matches your expectations.

Guidance Expected output: - Parts 1: The samples will be different (almost certainly) - Parts 2: The samples will be identical - Part 3: Without `random_state`, the splits will likely differ. With `random_state=42`, they will be identical. This is a powerful demonstration for understanding why random seeds matter. Save this script — you can show it to teammates who are skeptical about seeding.

Exercise 33.10Viewing git diffs ⭐⭐

In your practice repository:

  1. Modify an existing file (change a number, fix a typo, add a line)
  2. Run git diff to see the unstaged changes
  3. Stage the file with git add
  4. Run git diff --staged to see the staged changes
  5. Commit with a descriptive message
  6. Run git log -1 --stat to see what was changed in the last commit

Describe what information each command provides and when you would use it.

Guidance - `git diff` shows changes in your working directory that have NOT been staged. Use this to review your work before staging. "What have I changed?" - `git diff --staged` shows changes that HAVE been staged and will be included in the next commit. Use this to verify what you are about to commit. "What am I about to save?" - `git log -1 --stat` shows the last commit with a summary of files changed and lines added/removed. Use this to verify what was just committed. "What did I just save?" These three commands form a review workflow: diff → stage → diff --staged → commit → log.

Exercise 33.11Undoing mistakes in git ⭐⭐

Practice these common "undo" operations in your practice repository:

  1. Unstage a file: Stage a file, then remove it from the staging area without losing the changes
  2. Discard changes: Modify a file, then discard the modifications (return to the last committed version)
  3. Amend a commit message: Make a commit with a typo in the message, then fix the message

For each, write the git commands you used.

Guidance 1. **Unstage a file:**
# Stage a file
git add some_file.py
# Unstage it (changes remain in working directory)
git restore --staged some_file.py
2. **Discard changes:**
# Modify a file, then discard modifications
git restore some_file.py
# WARNING: This permanently discards your changes!
3. **Amend a commit message:**
# Make a commit with a typo
git commit -m "Aded analysis script"
# Fix the message (only use for unpushed commits!)
git commit --amend -m "Added analysis script"
**Important:** `git commit --amend` rewrites history. Only use it for commits that have NOT been pushed to a shared remote. Amending a pushed commit can cause problems for collaborators.

Part C: Team Simulation ⭐⭐⭐


Exercise 33.12Simulating a merge conflict ⭐⭐⭐

Create a merge conflict on purpose and resolve it:

  1. Create a file with a few lines of text and commit it to main
  2. Create a branch called feature-a and modify line 3
  3. Switch to main and modify the same line 3 differently
  4. Try to merge feature-a into main
  5. Resolve the conflict by choosing the correct version
  6. Complete the merge with a commit
Guidance
# Setup: create and commit a file
cat > data_config.py << 'EOF'
# Data configuration
DATA_DIR = "data/"
OUTPUT_FORMAT = "csv"
RANDOM_SEED = 42
EOF
git add data_config.py
git commit -m "Add data configuration file"

# Branch A: change OUTPUT_FORMAT
git checkout -b feature-a
# (Edit data_config.py to change OUTPUT_FORMAT = "parquet")
git add data_config.py
git commit -m "Change output format to parquet for better performance"

# Back to main: change the same line differently
git checkout main
# (Edit data_config.py to change OUTPUT_FORMAT = "json")
git add data_config.py
git commit -m "Change output format to json for web compatibility"

# Try to merge — this will conflict
git merge feature-a
Git will show a conflict in `data_config.py`. Open the file, you will see:
<<<<<<< HEAD
OUTPUT_FORMAT = "json"
=======
OUTPUT_FORMAT = "parquet"
>>>>>>> feature-a
Choose the correct version (or combine: perhaps `OUTPUT_FORMAT = "parquet"` with a comment about the reasoning), remove the conflict markers, stage, and commit:
git add data_config.py
git commit -m "Resolve conflict: use parquet format for performance (see PR discussion)"

Exercise 33.13Pull request simulation ⭐⭐⭐

If you have a GitHub account, practice the pull request workflow:

  1. Create a new repository on GitHub
  2. Clone it locally
  3. Create a branch, make changes, and push the branch
  4. Open a pull request on GitHub
  5. Add a description explaining what changed and why
  6. (If working with a partner) Have them review and approve
  7. Merge the pull request on GitHub
  8. Pull the merged changes locally

If you do not have a GitHub account, write out the steps you would take and explain what each step accomplishes.

Guidance
# Clone
git clone https://github.com/yourusername/practice-repo.git
cd practice-repo

# Create branch
git checkout -b feature/add-data-cleaning

# Make changes (add a file, edit code, etc.)
echo "# Data Cleaning Script" > clean_data.py
git add clean_data.py
git commit -m "Add data cleaning script skeleton"

# Push branch to remote
git push -u origin feature/add-data-cleaning
Then on GitHub: - Navigate to the repository - Click "Compare & pull request" (GitHub usually shows this automatically) - Write a title and description - Request review from a teammate - After approval, click "Merge pull request" Back locally:
git checkout main
git pull origin main

Exercise 33.14Code review practice ⭐⭐⭐

Review the following code as if it were submitted in a pull request. For each issue you find, explain the problem and suggest a fix. Look for reproducibility problems, unclear code, and missing documentation.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv('/home/sarah/Desktop/project/data.csv')

# clean
df = df.dropna()
df = df[df.age > 0]
df = df[df.age < 120]

X = df[['age', 'income', 'zipcode']]
y = df['purchased']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
Guidance Issues and suggested fixes: 1. **Absolute file path:** `/home/sarah/Desktop/project/data.csv` will not work on anyone else's machine. Fix: use a relative path like `data/data.csv`. 2. **No random seed:** `train_test_split` and `RandomForestClassifier` both use randomness. Fix: add `random_state=42` to both. 3. **Undocumented data cleaning:** Why drop all NAs? Why the age filters? Fix: add comments explaining the rationale (e.g., "Ages below 0 are data entry errors; ages above 120 are implausible"). 4. **Using zipcode as a feature:** As discussed in Chapter 32, zipcode can be a proxy for race, leading to discriminatory predictions. Fix: at minimum, document this decision and evaluate for disparate impact. 5. **Import not at top:** `RandomForestClassifier` is imported midway through the script. Fix: move all imports to the top. 6. **No output context:** `print(model.score(...))` prints a number with no label. Fix: `print(f"Test accuracy: {model.score(X_test, y_test):.3f}")`. 7. **No data exploration:** The script jumps from loading to modeling with no exploration. Fix: add `df.shape`, `df.dtypes`, `df.describe()` to understand the data first. 8. **No requirements or version info:** No indication of which library versions are needed. Fix: create a `requirements.txt`.

Exercise 33.15Writing a CONTRIBUTING.md ⭐⭐⭐

Write a CONTRIBUTING.md file for a team data science project. Include:

  1. How to set up the development environment
  2. The branching strategy (what to name branches, when to create them)
  3. Commit message format requirements
  4. The pull request process (who reviews, how long to wait)
  5. Code style conventions (naming, comments, notebook formatting)
  6. How to report bugs or request features
Guidance A good CONTRIBUTING.md reduces friction in team collaboration. Key elements:
# Contributing to [Project Name]

## Environment Setup
1. Clone the repository
2. Create a conda environment: `conda env create -f environment.yml`
3. Activate: `conda activate project-name`

## Branching Strategy
- Create a branch for each feature or fix
- Name branches: `feature/description`, `fix/description`, `docs/description`
- Never commit directly to `main`

## Commit Messages
- Start with a verb: Add, Fix, Update, Remove, Refactor
- Keep the first line under 72 characters
- Reference issue numbers when applicable: "Fix #42: correct rate calculation"

## Pull Requests
- Include a description of what changed and why
- Request review from at least one team member
- Address all review comments before merging
- Reviewers should respond within 2 business days

## Code Style
- Python: follow PEP 8
- Notebooks: use Markdown cells to narrate; clean up before committing
- Variable names: descriptive (`vaccination_rate`, not `vr`)
- Set random seeds: `np.random.seed(42)` at the top of every notebook

## Issues
- Use GitHub Issues to report bugs or request features
- Include: expected behavior, actual behavior, steps to reproduce

Part D: Synthesis and Extension ⭐⭐⭐–⭐⭐⭐⭐


Exercise 33.16Full project setup ⭐⭐⭐

Set up a complete, reproducible project from scratch:

  1. Create a project directory with the standard structure (notebooks, data, src, results)
  2. Initialize a git repository
  3. Create a virtual environment and install necessary packages
  4. Generate a requirements.txt
  5. Write a .gitignore
  6. Write a README
  7. Create a simple analysis notebook with random seeds set
  8. Make three well-messaged commits (initial setup, add analysis, add documentation)
  9. Verify the entire project can be set up from scratch by a new user

This exercise brings together everything in the chapter.

Guidance The verification step (9) is the most important. To verify: - Delete the virtual environment - Recreate it using only `requirements.txt` - Run the notebook - Do you get the same results? If yes, your project is reproducible. If no, identify what is missing and fix it.

Exercise 33.17Reproducibility forensics ⭐⭐⭐

You receive a Jupyter notebook from a colleague with these issues: - No requirements.txt or environment.yml - Cells run out of order (cell 5 depends on cell 12) - Absolute file paths for data - No random seeds - Results are described in text ("accuracy was about 87%") but not produced by visible code - Several cells produce warnings that are not addressed

Write a plan to make this notebook reproducible. List every specific action you would take, in order.

Guidance **Step-by-step remediation plan:** 1. **Determine the notebook's purpose:** Read through the entire notebook to understand what it is trying to accomplish. 2. **Create the environment:** Install the libraries the notebook seems to use (infer from import statements). Run cells to identify any additional missing dependencies. Pin versions in a requirements.txt. 3. **Reorganize cells:** Reorder cells so the notebook runs top-to-bottom without jumping. Use "Restart and Run All" to verify. 4. **Fix file paths:** Replace all absolute paths with relative paths. Document where the data comes from in a README. 5. **Add random seeds:** Add `np.random.seed(42)` at the top. Add `random_state=42` to all scikit-learn functions. 6. **Make results code-generated:** Replace text-described results with actual code outputs. If "accuracy was about 87%," add a cell that computes and prints the accuracy. 7. **Address warnings:** Investigate each warning. Either fix the underlying issue or explicitly suppress the warning with a documented reason. 8. **Add narration:** Add Markdown cells explaining each section. 9. **Test:** Restart kernel, Run All, verify all cells execute without error and produce expected output. 10. **Commit:** Initialize git, create .gitignore, commit the cleaned notebook with a clear message.

Exercise 33.18Git history detective ⭐⭐⭐

Using a public GitHub repository (any popular open-source project), explore the git history:

  1. How many commits does the repository have?
  2. Find a commit message that is particularly informative. What makes it good?
  3. Find a commit message that is uninformative. What would improve it?
  4. Find a pull request with code review comments. What did the reviewer catch?
  5. Look at the .gitignore file. What types of files are excluded?

Write brief answers for each. This exercise builds your ability to read and navigate repository history.

Guidance Good repositories to explore: - `pandas-dev/pandas` — the pandas library itself - `scikit-learn/scikit-learn` — scikit-learn - `mwaskom/seaborn` — seaborn On GitHub, you can browse commit history, pull requests, and .gitignore files through the web interface. Look at the "Pull requests" tab (both open and closed) to see code review in action. This exercise is valuable because reading other people's git history teaches you what good practices look like in real projects.

Exercise 33.19The reproducibility report card ⭐⭐⭐⭐

Go back to an analysis you completed earlier in this course (any chapter's exercise or project milestone). Grade it on reproducibility using this rubric:

Criterion Score (0-3)
Version controlled (in git)
Dependencies documented (requirements.txt)
Random seeds set
File paths are relative
README exists and is complete
Analysis runs top-to-bottom without manual intervention
Results are code-generated, not manually recorded
Data source is documented

Score each criterion: 0 = not done, 1 = partially done, 2 = mostly done, 3 = fully done.

Then: pick the two lowest-scoring criteria and fix them. Document what you changed.

Guidance Most students will find that their earlier work scores poorly on several criteria — and that is the point. Reproducibility is a practice you build over time. By identifying your weakest areas and fixing them, you establish habits that will carry forward. Common gaps: no requirements.txt (easy to fix now), absolute file paths (easy to fix), no random seeds (easy to fix), no README (more effort but highly valuable).

Exercise 33.20Environment debugging ⭐⭐⭐

Your colleague sends you a project with a requirements.txt. When you run pip install -r requirements.txt, you get errors because:

  1. One package has been removed from PyPI
  2. Two packages have version conflicts (Package A requires numpy<1.24, Package B requires numpy>=1.25)
  3. One package requires a C compiler that is not installed on your system

For each problem, describe what the error message might look like and propose a solution.

Guidance 1. **Package removed:** Error: `ERROR: Could not find a version that satisfies the requirement some_old_package==0.3.2`. Solution: check if the package was renamed (common) or if an alternative package provides the same functionality. If the dependency is not actually used in the code, remove it from requirements.txt. 2. **Version conflict:** Error: `ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. Some combination of conflicting dependencies...` Solution: Check whether either package has a newer version that resolves the conflict. If not, you may need to choose compatible versions manually. Use `pip install packageA packageB` (without version pins) to let pip resolve the conflict, then re-freeze. 3. **Missing compiler:** Error: `error: Microsoft Visual C++ 14.0 is required` or `gcc: command not found`. Solution: Install the required build tools (Visual Studio Build Tools on Windows, `xcode-select --install` on macOS, `build-essential` on Linux). Alternatively, look for a pre-compiled wheel of the package (`pip install --prefer-binary package_name`). The meta-lesson: environment reproduction is not always smooth, but the problems are solvable, and having a requirements.txt makes debugging much easier than having no record at all.

Exercise 33.21Reproducibility across programming languages ⭐⭐⭐⭐

Research how reproducibility and environment management work in ONE of the following ecosystems (choose one that is NOT Python):

  1. R (renv, packrat, Docker)
  2. Julia (Pkg, Project.toml)
  3. JavaScript/Node.js (package.json, npm)

Write a 200-word comparison with Python's approach. What is similar? What is different? What could Python learn from the other ecosystem?

Guidance **R's `renv`** is similar to Python's virtual environments but is more tightly integrated with the language. `renv::snapshot()` creates a lockfile (like `requirements.txt` but with more metadata). R's `sessionInfo()` function captures the complete runtime environment including OS and R version — something Python lacks as a built-in. Python could learn from R's emphasis on capturing the full session context. **Julia's `Pkg`** uses `Project.toml` and `Manifest.toml` — a two-file system where the Project file specifies direct dependencies and the Manifest file locks exact versions of ALL dependencies (including transitive ones). This is more robust than Python's `requirements.txt` and similar to `pip freeze` combined with `pip-tools`. Julia's approach is arguably the most elegant among modern languages. **Node.js's `package.json`** and `package-lock.json` parallel Julia's approach. The `node_modules` directory (equivalent to a virtual environment) is per-project by default, and `npm install` recreates it deterministically from the lockfile. Node's ecosystem solved the "dependency hell" problem earlier than Python and offers lessons in lockfile design.

Exercise 33.22Docker for reproducibility ⭐⭐⭐⭐

Research Docker containers and write a 150-word explanation of how they solve reproducibility problems that virtual environments cannot. Specifically address:

  1. What is a Docker container?
  2. How does it differ from a virtual environment?
  3. What reproducibility problems does it solve that requirements.txt alone cannot?
Guidance A Docker container is a lightweight, isolated environment that packages not just Python libraries, but the entire operating system, system libraries, and runtime environment. While a virtual environment isolates Python packages, a Docker container isolates *everything* — the OS, the file system, the system libraries, and the Python environment. This solves problems that requirements.txt cannot: - **System-level dependencies:** Some Python packages require system libraries (C compilers, database drivers) that vary by OS. Docker includes these. - **OS differences:** Code that runs on Ubuntu may not run on macOS. Docker containers run the same OS everywhere. - **Complete reproducibility:** A Docker image captures the exact state of the entire computing environment, not just the Python packages. The tradeoff: Docker adds complexity (you need to learn Dockerfile syntax and container management). For most data science projects, requirements.txt is sufficient; Docker is valuable for production deployment and when system-level dependencies are involved.

Exercise 33.23Data versioning ⭐⭐⭐⭐

Git tracks code changes well, but data changes are harder. Research ONE tool for data versioning:

  1. DVC (Data Version Control)
  2. Git LFS (Large File Storage)
  3. Delta Lake

Write a 150-word summary of how the tool works and when you would use it.

Guidance **DVC** works alongside git: you use git to track code and DVC to track data. DVC stores metadata (file hashes) in git, while the actual data files are stored in a remote storage backend (S3, Google Drive, etc.). When you check out a git commit, DVC retrieves the corresponding data version. Use DVC when your project has large datasets that change over time and you need to reproduce analyses with specific data versions. **Git LFS** extends git to handle large files. Instead of storing the full file content, git LFS stores a pointer in the git repository and the actual file on a separate server. Use Git LFS when you have a few large files (models, images) that need to be version-controlled but are too large for regular git. **Delta Lake** is a storage layer that brings versioning to data lakes. It stores a transaction log that records every change to the data, enabling time travel (querying past versions). Use Delta Lake for production data pipelines where data is continuously updated.

Exercise 33.24Team workflow design ⭐⭐⭐⭐

You are leading a data science team of four people working on a six-month project. Design a complete workflow covering:

  1. Repository structure
  2. Branching strategy
  3. Code review process
  4. Environment management
  5. Documentation standards
  6. How to handle large datasets
  7. Meeting cadence and communication tools

Write this as a one-page team agreement (about 400 words).

Guidance A strong team agreement is specific and practical. Rather than "we will use git," specify "we will use GitHub with branch protection on main requiring one approval." Rather than "we will document," specify "every notebook must start with a Markdown cell stating the question being addressed, the data source, and the date." The agreement should be short enough to read in five minutes and specific enough that a new team member could follow it without further explanation. Consider including: naming conventions (for branches, files, and variables), how often to merge branches (daily? weekly?), where to store data (shared drive? S3 bucket?), and what happens when someone breaks main (it happens — have a plan).

Exercise 33.25The reproducibility pledge

Write a three-sentence personal commitment to reproducibility. What specific practices will you adopt from this chapter, starting with your next project?

This is a reflection exercise — there is no wrong answer, but be specific.

Guidance A strong pledge is concrete: **Weak:** "I will try to make my work more reproducible." **Strong:** "Starting with my next project, I will (1) initialize a git repository before writing any code and commit at least once per work session, (2) create a requirements.txt with pinned versions before sharing any code, and (3) set `np.random.seed(42)` at the top of every notebook." Three specific, actionable commitments are worth more than a vague intention.

Reflection

The tools in this chapter — git, virtual environments, documentation — are not exciting. They do not produce cool visualizations or impressive models. But they are the infrastructure that makes everything else possible. Without version control, you lose work. Without environment management, your code breaks. Without documentation, your analysis is a mystery even to you.

The best time to adopt these practices was at the beginning of the course. The second best time is now. Every project from this point forward should start with git init, a virtual environment, and a README.