Bibliography
How this bibliography is organized: Sources are grouped into three tiers reflecting our confidence in their accuracy and the way they are used in the text. Tier 1 contains fully verified, published works with complete citations. Tier 2 contains sources for specific attributed claims. Tier 3 provides context for illustrative examples. A recommended reading list organized by topic follows.
Tier 1: Primary References
These are published books and papers cited throughout the text. All are real, verified publications with complete bibliographic information.
Foundational Data Science and Statistics
Diez, D. M., Cetinkaya-Rundel, M., & Barr, C. D. (2019). OpenIntro Statistics (4th ed.). OpenIntro, Inc. https://www.openintro.org/book/os/
An open-source introductory statistics textbook. Referenced in Part IV (Statistical Thinking) for its clear treatment of hypothesis testing, confidence intervals, and probability.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in Python (2nd ed.). Springer. https://www.statlearning.com/
The standard introductory machine learning textbook. Referenced in Part V (First Models) for its treatment of the bias-variance tradeoff, linear regression, logistic regression, and decision trees.
McKinney, W. (2022). Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter (3rd ed.). O'Reilly Media.
The definitive guide to pandas by the library's creator. Referenced extensively in Part II (Data Wrangling) for DataFrame operations, data cleaning, and reshaping.
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1--23. https://doi.org/10.18637/jss.v059.i10
The paper that formalized the concept of tidy data (one observation per row, one variable per column). Referenced in Chapter 9 (Reshaping and Transforming Data).
VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media. https://jakevdp.github.io/PythonDataScienceHandbook/
A comprehensive guide to the Python data science stack. Referenced throughout for NumPy, pandas, matplotlib, and scikit-learn usage patterns.
Bruce, P., Bruce, A., & Gedeck, P. (2020). Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python (2nd ed.). O'Reilly Media.
A practitioner-oriented statistics reference. Referenced in Part IV for its accessible explanations of statistical tests and sampling methods.
Visualization
Tufte, E. R. (2001). The Visual Display of Quantitative Information (2nd ed.). Graphics Press.
The classic work on data visualization design. Referenced in Chapter 18 (Visualization Design) for principles of data-ink ratio, chart integrity, and graphical excellence.
Wilkinson, L. (2005). The Grammar of Graphics (2nd ed.). Springer.
The theoretical foundation for the grammar of graphics framework. Referenced in Chapter 14 for the conceptual model of data, aesthetics, geometries, scales, coordinates, and facets.
Cairo, A. (2016). The Truthful Art: Data, Charts, and Maps for Communication. New Riders.
A guide to honest and effective data visualization. Referenced in Chapter 18 for discussions of misleading charts and ethical visualization design.
Knaflic, C. N. (2015). Storytelling with Data: A Data Visualization Guide for Business Professionals. Wiley.
A practical guide to communicating with data. Referenced in Chapter 31 (Communicating Results) for its framework of context, audience, and narrative.
Python Programming
Matthes, E. (2023). Python Crash Course: A Hands-On, Project-Based Introduction to Programming (3rd ed.). No Starch Press.
An introductory Python textbook. Referenced in Part I (Welcome to Data Science) for foundational Python concepts.
Sweigart, A. (2019). Automate the Boring Stuff with Python: Practical Programming for Total Beginners (2nd ed.). No Starch Press.
A practical Python guide focused on automation. Referenced for file I/O patterns and web scraping basics.
Machine Learning
Geron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.). O'Reilly Media.
A comprehensive machine learning guide. Referenced in Part V for its treatment of the machine learning workflow, model evaluation, and scikit-learn pipelines.
Muller, A. C., & Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data Scientists. O'Reilly Media.
A scikit-learn-focused machine learning introduction. Referenced in Chapters 25--30 for practical modeling patterns.
Ethics and Communication
O'Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Books.
An investigation of how algorithmic decision-making can perpetuate bias. Referenced in Chapter 32 (Ethics in Data Science) for case studies of biased algorithms in criminal justice, hiring, and lending.
D'Ignazio, C., & Klein, L. F. (2020). Data Feminism. MIT Press. https://data-feminism.mitpress.mit.edu/
A framework for thinking about power, bias, and justice in data science. Referenced in Chapter 32 for its analysis of how data collection practices reflect and reinforce social inequalities.
Criado Perez, C. (2019). Invisible Women: Data Bias in a World Designed for Men. Abrams Press.
An examination of how the absence of gender-disaggregated data leads to policies and products that disadvantage women. Referenced in Chapter 32.
Reproducibility and Professional Practice
Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), e1005510. https://doi.org/10.1371/journal.pcbi.1005510
Practical guidelines for computational reproducibility. Referenced in Chapter 33 (Reproducibility and Collaboration) for its recommendations on file organization, version control, and documentation.
Chacon, S., & Straub, B. (2014). Pro Git (2nd ed.). Apress. https://git-scm.com/book/en/v2
The comprehensive guide to Git. Referenced in Chapter 33 for version control concepts and workflows.
Tier 2: Attributed Claims and Specific References
These sources support specific factual claims made in the text.
Davenport, T. H., & Patil, D. J. (2012, October). Data scientist: The sexiest job of the 21st century. Harvard Business Review. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
The article that popularized the term "data scientist" in mainstream business culture. Referenced in Chapter 1 for the history of the field.
Tukey, J. W. (1962). The future of data analysis. The Annals of Mathematical Statistics, 33(1), 1--67.
A foundational paper arguing that statistics should be more empirical and computational. Referenced in Chapter 1 for the intellectual origins of data science.
Cleveland, W. S. (2001). Data science: An action plan for expanding the technical areas of the field of statistics. International Statistical Review, 69(1), 21--26.
An early articulation of data science as a discipline distinct from traditional statistics. Referenced in Chapter 1.
Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16(3), 199--231.
A landmark paper distinguishing between data modeling (traditional statistics) and algorithmic modeling (machine learning). Referenced in Chapter 25 for the distinction between explanation and prediction.
Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17--21.
The paper introducing Anscombe's quartet --- four datasets with identical summary statistics but very different distributions. Referenced in Chapter 19 to illustrate why visualization matters alongside numerical summaries.
Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing (3rd ed.). Academic Press.
Referenced in Chapter 19 for the discussion of robust statistics (median, IQR) vs. non-robust measures (mean, standard deviation).
Freedman, D. A. (2009). Statistical Models: Theory and Practice (2nd ed.). Cambridge University Press.
Referenced in Chapter 24 for its careful treatment of the distinction between correlation and causation, and the challenges of observational studies.
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., ... & Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357--362. https://doi.org/10.1038/s41586-020-2649-2
The primary citation for the NumPy library. Referenced in Chapter 7 for the scientific computing foundations underlying pandas.
McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 56--61.
The original paper introducing the pandas library. Referenced in Chapter 7.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825--2830.
The primary citation for the scikit-learn library. Referenced in Chapter 25.
Tier 3: Illustrative Context
These sources provide background for examples, case studies, and historical context used in the text. They are not directly cited but inform the narrative.
-
The WHO Global Health Observatory data portal (https://www.who.int/data/gho) provides the primary dataset for Elena's progressive project on vaccination rates and global health indicators.
-
The U.S. Census Bureau's American Community Survey provides the demographic data framework discussed in data collection chapters.
-
Basketball Reference (https://www.basketball-reference.com/) provides the NBA statistics used in Priya's sports analytics examples.
-
FiveThirtyEight's open data repository (https://github.com/fivethirtyeight/data) provides several datasets used in exercises throughout the book.
-
The UCI Machine Learning Repository (https://archive.ics.uci.edu/) provides benchmark datasets referenced in modeling exercises.
Recommended Further Reading
Organized by topic for readers who want to go deeper after completing this book.
Python Programming (Beyond the Basics)
-
Ramalho, L. (2022). Fluent Python (2nd ed.). O'Reilly Media. Deep dive into Python's object model, data structures, and design patterns.
-
Beazley, D. M., & Jones, B. K. (2013). Python Cookbook (3rd ed.). O'Reilly Media. Practical recipes for common Python programming tasks.
Statistics and Probability
-
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. A more rigorous treatment of statistical theory for readers comfortable with mathematics.
-
Downey, A. B. (2021). Think Stats: Exploratory Data Analysis in Python (2nd ed.). O'Reilly Media. https://greenteapress.com/thinkstats2/ A computation-first approach to statistics using Python.
-
Salsburg, D. (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. W. H. Freeman. An accessible history of statistical ideas for readers who enjoy narrative nonfiction.
Data Visualization (Advanced)
-
Munzner, T. (2014). Visualization Analysis and Design. CRC Press. A systematic framework for choosing and designing visualizations based on data type and task.
-
Few, S. (2012). Show Me the Numbers: Designing Tables and Graphs to Enlighten (2nd ed.). Analytics Press. Practical guidelines for presenting quantitative information clearly.
Machine Learning (Next Steps)
-
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. https://hastie.su.domains/ElemStatLearn/ The graduate-level companion to An Introduction to Statistical Learning. More mathematical, more comprehensive.
-
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. A thorough treatment of machine learning from a probabilistic perspective.
Data Engineering and SQL
-
Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media. An excellent guide to the systems that store and process data at scale.
-
Molinaro, A., & Graefe, R. (2020). SQL Cookbook (2nd ed.). O'Reilly Media. Practical SQL recipes for common data tasks.
Ethics, Fairness, and Society
-
Eubanks, V. (2018). Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. St. Martin's Press. Case studies of how automated systems harm vulnerable populations.
-
Benjamin, R. (2019). Race After Technology: Abolitionist Tools for the New Jim Code. Polity. An examination of how technology can reinforce racial hierarchies.
-
Zuboff, S. (2019). The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. PublicAffairs. A broad analysis of how personal data is collected and monetized.
Career Development
- Robinson, E., & Nolis, J. (2020). Build a Career in Data Science. Manning Publications. Practical career advice from two experienced data scientists, covering job searching, interviewing, and professional development.
A note on citation practices: Throughout this textbook, we have aimed to cite only sources we are confident exist and that readers can locate. In a field that moves as quickly as data science, some URLs may change over time. If a link is broken, searching for the title and authors will usually locate the current version. For library-specific documentation, the official documentation sites (pandas.pydata.org, scikit-learn.org, matplotlib.org, seaborn.pydata.org) are always the most current and authoritative references.