Further Reading: Reproducibility and Collaboration: Git, Environments, and Working with Teams
The tools and practices in this chapter are foundational for professional data science. The resources below range from practical git tutorials to deeper explorations of the reproducibility crisis and its implications.
Tier 1: Verified Sources
Scott Chacon and Ben Straub, Pro Git (Apress, 2nd edition, 2014). The definitive guide to git. The entire book is available free online at git-scm.com/book. It covers everything from basic commands to advanced topics like rebasing, cherry-picking, and git internals. Start with Chapters 1-3 for a solid foundation, then explore advanced topics as needed. This is the reference you will return to throughout your career.
Jake VanderPlas, A Whirlwind Tour of Python (O'Reilly, 2016) and the associated Python Data Science Handbook (O'Reilly, 2016). While not specifically about reproducibility, VanderPlas's handbook demonstrates excellent practices in how to structure data science notebooks, manage dependencies, and write reproducible analyses. The Jupyter notebooks for the handbook are publicly available on GitHub and serve as models of what narrative notebooks should look like.
Wilson et al., "Good Enough Practices in Scientific Computing," PLOS Computational Biology 13(6), 2017. A landmark paper that distills reproducible research practices into a practical checklist for scientists. The paper covers data management, software, collaboration, project organization, and manuscript preparation. The recommendations are deliberately "good enough" — not perfect, but vastly better than common practice. If you read one academic paper on reproducibility, make it this one.
Wilson et al., "Best Practices for Scientific Computing," PLOS Biology 12(1), 2014. The predecessor to the "Good Enough" paper, this one aims higher — outlining best practices for researchers who are ready to invest more in computational reproducibility. Covers version control, testing, code review, and documentation.
National Academies of Sciences, Engineering, and Medicine, Reproducibility and Replicability in Science (National Academies Press, 2019). A comprehensive report on the reproducibility crisis from the U.S. National Academies. It defines terms (reproducibility vs. replicability), surveys the evidence across fields, and makes recommendations for researchers, institutions, and funders. Dense but authoritative.
Begley & Ellis, "Drug development: Raise standards for preclinical cancer research," Nature 483:531-533, 2012. The Amgen study discussed in Case Study 1. A short, readable paper that catalyzed the reproducibility debate in biomedical research by documenting the 11% reproduction rate for landmark cancer studies.
Tier 2: Attributed Resources
GitHub's "Git Handbook" and interactive tutorials. GitHub provides excellent introductory resources for learning git, including an interactive tutorial that walks you through the basics in a web browser. Search for "GitHub git handbook" or "GitHub Learning Lab." These are ideal for beginners who want to learn by doing.
The Carpentries: "Version Control with Git." Software Carpentry (part of The Carpentries) offers a free, self-paced tutorial on git designed specifically for researchers and data scientists. It covers the basics clearly and includes exercises. Search for "Software Carpentry git lesson."
Atlassian Git Tutorials. Atlassian (makers of Bitbucket) maintains a comprehensive set of git tutorials covering workflows, branching strategies, and advanced topics. Their "Git Workflow" comparison (Centralized, Feature Branch, Gitflow, Forking) is particularly useful for understanding how teams use git differently. Search for "Atlassian git tutorials."
Real Python: "Python Virtual Environments: A Primer." A clear, practical guide to creating and managing virtual environments in Python using both venv and conda. Includes explanations of why virtual environments matter and how they work internally. Search for "Real Python virtual environments."
The Open Science Collaboration, "Estimating the reproducibility of psychological science," Science 349(6251), 2015. The landmark paper from the Reproducibility Project: Psychology that attempted to replicate 100 published studies. A sobering and important read for anyone who uses data to draw conclusions about the world.
Keith Baggerly and Kevin Coombes, "Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-Throughput Biology," Annals of Applied Statistics 3(4):1309-1334, 2009. The paper documenting the Duke clinical trials reproducibility failures discussed in Case Study 1. Technical but accessible, and a powerful argument for computational reproducibility.
Jenny Bryan, "Excuse Me, Do You Have a Moment to Talk About Version Control?" The American Statistician 72(1):20-27, 2018. A persuasive and accessible paper aimed at convincing data analysts and statisticians to adopt git. Bryan addresses common objections ("it's too hard," "I work alone") and explains why version control is essential even for solo projects.
Recommended Next Steps
-
If you are new to git: Start with GitHub's interactive tutorial, then read Chapters 1-3 of Pro Git. Practice by version-controlling a small personal project. The command line is the best way to learn — GUI tools hide too much of what is happening.
-
If you want to improve your team workflow: Read the Atlassian Git Workflow comparison to understand different branching strategies (Feature Branch, Gitflow, Trunk-Based Development). The Feature Branch workflow described in this chapter is the most common for small teams, but larger teams may benefit from more structured approaches.
-
If you are interested in the reproducibility crisis: Start with the Wilson et al. "Good Enough Practices" paper for practical recommendations, then read the Begley & Ellis and Open Science Collaboration papers for the motivating evidence. The National Academies report provides the most comprehensive overview.
-
If you want to go beyond requirements.txt: Explore Docker for containerized reproducibility (useful for deployment and for analyses with complex system dependencies). For data versioning, look into DVC (Data Version Control), which works alongside git to track data files and model artifacts.
-
If you want to learn about Continuous Integration (CI): CI tools like GitHub Actions automatically run tests and checks every time code is pushed. This ensures that the main branch always contains working code. GitHub Actions has a free tier and many pre-built templates for Python projects.
-
If you want to see reproducibility in action: Browse the GitHub repositories of well-maintained open-source data science projects like scikit-learn, pandas, or seaborn. Notice their directory structure, CONTRIBUTING.md files, test suites, and CI configurations. These represent the gold standard of reproducible, collaborative development.
A Final Thought
The tools in this chapter — git, virtual environments, documentation — are not data science tools. They are general-purpose tools that happen to be essential for data science. They were developed by software engineers, refined over decades, and adopted by data scientists because the alternative (chaos) is too costly.
Learning these tools is an investment. The first few weeks of using git feel slower than just saving files. The first requirements.txt feels like unnecessary overhead. The first README feels like writing that nobody will read.
But there will come a moment — maybe when a colleague clones your repository and gets the analysis running in five minutes, or when you go back to a six-month-old project and it still works, or when a code reviewer catches a bug that would have taken you days to find — when you realize that the investment has paid for itself many times over.
That moment is coming. The tools are waiting.