Chapter 33 Quiz: Reproducibility and Collaboration: Git, Environments, and Working with Teams

Q: True or False: A README file should contain enough information for someone to set up and run your project without any additional help from you.

True. A good README is your project's front door. It should explain what the project does, how to install dependencies, how to obtain the data, and how to run the analysis. The test is: could a stranger follow the README and successfully run your project? If not, the README needs more detail.

Contributors to Introduction to Data Science

Chapter 33 Quiz: Reproducibility and Collaboration: Git, Environments, and Working with Teams

Instructions: This quiz tests your understanding of Chapter 33. Answer all questions before checking the solutions. For multiple choice, select the best answer. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. What is the primary purpose of version control?

(A) To make code run faster
(B) To track changes to files over time, allowing you to recall specific versions and collaborate without conflicts
(C) To compress files for storage
(D) To automatically fix bugs in code

Answer

**Correct: (B)** Version control tracks every change to every file in a project, records who made each change and when, and allows you to revert to any previous state. It enables collaboration by allowing multiple people to work on the same project without overwriting each other's work. It does not affect code performance (A), is not a compression tool (C), and does not fix bugs (D).

Question 2. In git, what is the correct order of operations to save changes?

(A) commit → add → push
(B) add → commit → push
(C) push → add → commit
(D) commit → push → add

Answer

**Correct: (B)** The correct workflow is: (1) `git add` to stage changes (move them to the staging area), (2) `git commit` to save the staged changes as a snapshot in the local repository, (3) `git push` to upload the committed changes to a remote repository. Staging before committing lets you choose which changes to include in each commit.

Question 3. Which of the following is the BEST commit message?

(A) "update"
(B) "Fix vaccination rate calculation: was dividing by total population instead of eligible population"
(C) "changed stuff"
(D) "commit"

Answer

**Correct: (B)** A good commit message starts with a verb, explains *what* changed, and explains *why*. (B) tells you exactly what was fixed and what the bug was. (A), (C), and (D) provide no useful information — looking at these in a git log six months later, you would have no idea what each commit changed.

Question 4. What is the purpose of a virtual environment?

(A) To make Python run in a virtual machine
(B) To isolate project dependencies so that each project has its own set of library versions
(C) To speed up Python execution
(D) To create a backup of your project

Answer

**Correct: (B)** A virtual environment creates an isolated Python installation with its own packages. This prevents conflicts between projects (Project A needs pandas 2.x, Project B needs pandas 3.x) and ensures that library updates for one project do not break another. It is not a virtual machine (A), does not affect execution speed (C), and is not a backup mechanism (D).

Question 5. What does pip freeze > requirements.txt do?

(A) Freezes the computer to prevent changes
(B) Writes the exact versions of all installed packages to a file, enabling others to recreate the environment
(C) Installs packages from requirements.txt
(D) Deletes unused packages

Answer

**Correct: (B)** `pip freeze` outputs a list of all installed packages with their exact versions. The `>` redirects this output to a file called `requirements.txt`. This file can then be used by others (or your future self) to recreate the exact same environment using `pip install -r requirements.txt`. (C) describes `pip install -r requirements.txt`, not `pip freeze`.

Question 6. What is a branch in git?

(A) A copy of the repository on a different computer
(B) An independent line of development where you can make changes without affecting the main code
(C) A file that lists all the changes in the repository
(D) A tool for merging two repositories together

Answer

**Correct: (B)** A branch is a parallel line of development. When you create a branch, you can modify files without affecting the main branch. This enables experimental work, feature development, and team collaboration. Branches can be merged back into the main branch when the work is complete, or deleted if the experiment did not work out.

Question 7. Which file should NEVER be committed to a git repository?

(A) README.md
(B) requirements.txt
(C) .env (containing API keys and passwords)
(D) analysis.ipynb

Answer

**Correct: (C)** Files containing secrets — API keys, passwords, database credentials, personal tokens — should NEVER be committed to a git repository. Once committed, they are part of the permanent history and can be found even if the file is later deleted. Use `.gitignore` to prevent these files from being tracked. All other options (A, B, D) are standard files that should be tracked.

Question 8. What is a pull request?

(A) A request to download a repository from GitHub
(B) A request to merge one branch into another, with an opportunity for code review and discussion
(C) A request to pull data from a database
(D) A request to delete a branch

Answer

**Correct: (B)** A pull request (PR) is a formal proposal to merge changes from one branch into another (usually into the main branch). It provides a space for code review, discussion, and approval before changes are integrated. Pull requests are the heart of collaborative development — they ensure that code is reviewed by at least one other person before it enters the shared codebase.

Question 9. Why is setting a random seed important for reproducibility?

(A) It makes the code run faster
(B) It ensures that random operations (like train/test splits) produce the same results every time
(C) It prevents bugs in random number generation
(D) It is required by Python syntax

Answer

**Correct: (B)** Many data science operations involve randomness — train/test splits, model initialization, bootstrapping. Without a fixed seed, these operations produce different results each run, making it impossible to reproduce results exactly. Setting `np.random.seed(42)` or using `random_state=42` ensures the same "random" sequence is generated every time, making the analysis reproducible.

Question 10. The "reproducibility crisis" refers to:

(A) The inability of computers to generate truly random numbers
(B) The widespread failure of published research to produce the same results when repeated
(C) The difficulty of learning git
(D) The shortage of data scientists in the workforce

Answer

**Correct: (B)** The reproducibility crisis is a well-documented phenomenon in which a significant proportion of published scientific results cannot be reproduced by independent researchers. In some fields, reproduction rates are as low as 11%. Causes include undocumented software environments, missing random seeds, unpublished data, and insufficient methodological detail. This chapter's tools — git, virtual environments, documentation — directly address these issues.

Section 2: True or False (4 questions, 4 points each)

Question 11. True or False: git add . stages all changed and new files in the current directory and subdirectories.

Answer

**True.** `git add .` stages all modifications, deletions, and new (untracked) files in the current directory and all subdirectories. This is convenient but can accidentally stage files you did not intend to commit (like data files or configuration secrets). It is safer to stage specific files by name or to have a comprehensive `.gitignore` file.

Question 12. True or False: Once a file is listed in .gitignore, it can never be tracked by git.

Answer

**False.** `.gitignore` only prevents *untracked* files from being staged. If a file was already tracked by git before it was added to `.gitignore`, git will continue tracking it. To stop tracking a file that is already committed, you need to explicitly remove it from tracking with `git rm --cached filename`.

Question 13. True or False: A merge conflict occurs when two branches modify different files.

Answer

**False.** A merge conflict occurs when two branches modify the *same lines* of the *same file*. If two branches modify different files, or even different parts of the same file, git can merge them automatically without a conflict. Conflicts only arise when git cannot determine which version of a specific line to keep.

Question 14. True or False: A README file should contain enough information for someone to set up and run your project without any additional help from you.

Answer

**True.** A good README is your project's front door. It should explain what the project does, how to install dependencies, how to obtain the data, and how to run the analysis. The test is: could a stranger follow the README and successfully run your project? If not, the README needs more detail.

Section 3: Short Answer (4 questions, 6 points each)

Question 15. Explain the difference between the working directory, the staging area, and the repository in git.

Answer

**Working directory:** The files on your disk as you see them. This is where you edit, create, and delete files during your normal work. Changes here are not yet recorded by git. **Staging area:** A holding zone (also called the "index") where you place changes that you want to include in your next commit. You move changes from the working directory to the staging area using `git add`. This lets you selectively choose which changes to commit — you might have modified five files but only want to commit changes to two of them. **Repository:** The permanent history of committed snapshots. When you run `git commit`, the staged changes are saved as a new commit in the repository's history. Commits are permanent (barring destructive operations) and can be revisited at any time. The flow is: edit (working directory) → stage (staging area) → commit (repository).

Question 16. Name three things that should be included in a .gitignore file for a Python data science project, and explain why each should be excluded.

Answer

1. **`__pycache__/` and `*.pyc`:** These are Python bytecode files, automatically generated when Python runs. They are machine-specific, serve no purpose in a repository, and would clutter the commit history with meaningless changes. 2. **`venv/` or `.conda/`:** Virtual environment directories contain hundreds of installed library files. They are machine-specific (compiled for your OS), extremely large, and should be recreated from `requirements.txt` rather than stored in git. 3. **`.env` or `credentials.json`:** Files containing API keys, passwords, or other secrets should NEVER be committed to a repository. If pushed to a public or shared remote, these secrets are exposed and can be exploited. Even in private repositories, secrets should be managed through environment variables or secret management tools, not committed to version control. Other valid answers: `.ipynb_checkpoints/` (Jupyter auto-saves), large data files (too big for git), `.DS_Store` (macOS system files), IDE configuration (`.vscode/`, `.idea/`).

Question 17. Explain the feature branch workflow used in team collaboration. Why is it better than everyone working directly on the main branch?

Answer

In the **feature branch workflow:** 1. Each team member creates a separate branch for their work (e.g., `feature/add-rural-analysis`) 2. They make changes and commit on their branch 3. When finished, they push the branch to the remote and open a pull request 4. Another team member reviews the code 5. After approval, the branch is merged into main This is better than working directly on main because: - **Isolation:** One person's experimental or in-progress work cannot break the shared codebase. If something goes wrong on a branch, main is unaffected. - **Review:** Pull requests enable code review — catching bugs, improving quality, and sharing knowledge. - **Parallel work:** Multiple people can work on different features simultaneously without interfering with each other. - **History:** Each feature gets its own clear commit history, making it easy to understand what was added and why.

Question 18. A colleague sends you a project with no requirements.txt, no README, and no git history. They say "just run analysis.ipynb." Describe three specific problems you might encounter and how each could have been prevented.

Answer

1. **Missing dependencies:** The notebook imports libraries that are not installed on your machine, or imports specific versions that differ from yours. The code might fail with import errors, or worse, run but produce different results due to version differences. **Prevention:** A `requirements.txt` with pinned versions would allow you to recreate the exact environment. 2. **Unknown data source:** The notebook loads data from a file path that does not exist on your machine, and you do not know where to get the data. **Prevention:** A README should document the data source, provide download instructions, and specify where to place the files. 3. **Non-reproducible results:** The notebook uses random operations (train/test splits, model training) without setting seeds, so you get different results than the colleague reported. Without git history, you cannot see what version of the code produced the original results. **Prevention:** Random seeds at the top of the notebook, and git commits that record the state of the code when results were produced.

Section 4: Applied Scenarios (2 questions, 8 points each)

Question 19. You are working on a data science team of three people. Person A is building a data cleaning pipeline, Person B is developing visualizations, and Person C (you) is building a predictive model. All three components depend on the same dataset.

Describe the git workflow you would use for this scenario. Address: (a) how each person's work stays isolated, (b) how changes are integrated, (c) what happens if Person A changes the data cleaning logic (which affects the data format that B and C depend on), and (d) how you ensure the integrated project works.

Answer

**(a) Isolation:** Each person works on their own branch: `feature/data-cleaning` (Person A), `feature/visualization` (Person B), `feature/prediction-model` (Person C). All branches are created from the current state of `main`. Each person commits their work to their own branch without affecting the others. **(b) Integration:** When a person's work is ready, they push their branch to the remote and open a pull request. At least one other team member reviews the code. After approval, the branch is merged into `main`. Other team members then pull the updated `main` and merge it into their own branches to stay current. **(c) Data format change:** Person A's change is high-impact — it affects downstream work. The team should: (1) Person A opens a PR and explicitly flags the data format change in the PR description. (2) Persons B and C review the PR to understand the new format. (3) After the PR is merged, B and C merge the updated `main` into their branches and update their code to match the new format. (4) If the format change is significant, Person A should notify the team in advance (via Slack, a meeting, or an issue) so B and C can prepare. **(d) Ensuring integration works:** After all three branches are merged into `main`, run the entire pipeline end-to-end: cleaning → analysis → visualization → model. Use a CI/CD tool or a manual "run all" test. If anything breaks, the git log and PR history will show which change caused the issue, making debugging easier.

Question 20. Your manager asks you to present a quarterly analysis to executives next Monday. You want to make sure the analysis is reproducible, but you only have one day to set up proper practices.

Given limited time, which THREE reproducibility practices from this chapter would you prioritize and implement first? Explain why each is the highest-leverage action.

Answer

With only one day, I would prioritize: **1. Create a requirements.txt with pinned versions.** This is the single highest-leverage reproducibility action. If the analysis breaks next quarter because a library was updated, the requirements.txt lets someone recreate the exact environment. It takes less than 5 minutes: `pip freeze > requirements.txt`. Without it, you may never be able to reproduce the results. **2. Set random seeds at the top of every notebook.** This takes approximately 30 seconds per notebook (`np.random.seed(42)`) and eliminates an entire class of reproducibility failures — different results from random operations. Without seeds, even running the same code on the same machine can produce different numbers. **3. Write a minimal README.** Even a 10-line README documenting the data source, the steps to run the analysis, and any assumptions is vastly better than nothing. When the manager asks someone else to update the analysis next quarter, the README is the difference between a 2-hour task and a 2-day reverse-engineering project. **Why not git?** Git is extremely important but takes more time to set up and learn if you have never used it. In a one-day scenario, the three items above provide the most reproducibility value per minute invested. Git should be the next priority after these three are in place.

Scoring Guide

Section	Points
Multiple Choice (10 x 4)	40
True/False (4 x 4)	16
Short Answer (4 x 6)	24
Applied Scenarios (2 x 8)	16
Total	96

Note: Remaining 4 points are reserved for exceptional depth in short answer or scenario responses. Passing score: 70/100.

The tools in this chapter are not just for passing quizzes — they are for building a sustainable data science practice. The real test comes the next time you start a project: will you git init first? Will you create a virtual environment? Will you write a README? If the answer is yes, this chapter has done its job.