Appendix D: Data Sources Guide
How to use this appendix: Throughout this book, we encourage you to practice with real data. This appendix is a curated directory of free, publicly available datasets organized by topic. Each entry includes what the data contains, how to access it, and what kinds of questions you could explore. Bookmark this page --- you will return to it whenever you need a fresh dataset for a project or exercise.
D.1 Multi-Topic Data Repositories
These platforms host datasets across many subjects. They are excellent starting points when you are not sure what to analyze.
Kaggle Datasets
- URL: kaggle.com/datasets
- What it is: A community-driven platform with over 200,000 datasets uploaded by users. Datasets range from tiny teaching examples to massive real-world collections. Many come with starter notebooks showing exploratory analysis.
- How to access: Create a free Kaggle account. Download datasets directly from the website, or use the Kaggle API:
bash pip install kaggle kaggle datasets download -d dataset-owner/dataset-name - Best for: Practice projects, competitions, and exploring diverse topics. Quality varies --- always inspect the data before trusting it.
- Recommended datasets for beginners:
- Titanic survival data (classification practice)
- House prices (regression practice)
- Iris flower dataset (clustering and classification)
- New York City Airbnb listings (exploratory analysis)
UCI Machine Learning Repository
- URL: archive.ics.uci.edu
- What it is: One of the oldest and most respected sources of benchmark datasets for machine learning research. Maintained by the University of California, Irvine. Datasets are well-documented with descriptions of variables, data types, and original research context.
- How to access: Browse by task type (classification, regression, clustering) or by subject area. Download CSV or data files directly.
- Best for: Structured exercises in modeling (Chapters 25--30). The documentation is excellent for teaching.
- Classic datasets: Wine quality, Heart disease, Adult income, Breast cancer Wisconsin, Auto MPG.
Google Dataset Search
- URL: datasetsearch.research.google.com
- What it is: A search engine specifically for datasets. Indexes datasets from thousands of repositories worldwide.
- How to access: Search by keyword, then follow links to the original data source.
- Best for: Finding datasets on specific topics when other sources come up short.
D.2 Health and Public Health
World Health Organization (WHO) Global Health Observatory
- URL: who.int/data/gho
- What it is: Health statistics for 194 WHO member states, covering topics like life expectancy, disease prevalence, immunization coverage, health workforce, and environmental health indicators. This is the primary data source for Elena's progressive project throughout the book.
- How to access: Browse the online data portal and download CSV files. The WHO also offers an API (the GHO OData API) for programmatic access.
- Best for: Chapters 6--13 (progressive project), global health comparisons, time-series analysis of health trends.
- Example questions: How do vaccination rates vary by WHO region? Is there a relationship between healthcare spending and life expectancy?
U.S. Centers for Disease Control and Prevention (CDC)
- URL: data.cdc.gov
- What it is: U.S. public health data including disease surveillance, vaccination records, mortality data, behavioral risk factors, and environmental health indicators.
- How to access: Search the data catalog and download CSV files. Many datasets are available through the Socrata Open Data API (SODA).
- Best for: U.S.-focused health analysis, time-series data, geographic comparisons across states.
- Highlighted datasets:
- COVID-19 case surveillance (millions of rows --- good for large-data practice)
- BRFSS (Behavioral Risk Factor Surveillance System) --- annual survey of 400,000+ adults
- WONDER (Wide-ranging ONline Data for Epidemiologic Research) --- mortality and population data
Our World in Data
- URL: ourworldindata.org
- What it is: Research and data on global development topics --- health, poverty, education, energy, environment. All charts are interactive with downloadable data.
- How to access: Click "Download" on any chart to get a CSV, or access their GitHub repository at github.com/owid for complete datasets.
- Best for: Clean, well-documented datasets ready for visualization practice. Excellent for comparative country-level analysis.
D.3 Economics and Business
World Bank Open Data
- URL: data.worldbank.org
- What it is: Economic and development indicators for every country: GDP, poverty rates, education, infrastructure, trade. Over 16,000 indicators with data going back decades.
- How to access: Browse indicators online and download CSV/Excel. The World Bank also offers a Python library:
python # pip install wbgapi import wbgapi as wb wb.data.DataFrame("NY.GDP.PCAP.CD", economy="all") - Best for: Country-level economic analysis, merging economic data with health or education data.
- Example questions: Is GDP per capita correlated with vaccination rates? How has poverty changed over the last 20 years?
U.S. Bureau of Labor Statistics (BLS)
- URL: bls.gov/data
- What it is: Employment, unemployment, wages, prices (CPI), productivity, and workplace safety data for the United States.
- How to access: Download from the website or use the BLS API (free, requires registration for a key).
- Best for: Labor market analysis, inflation studies, wage comparisons.
Federal Reserve Economic Data (FRED)
- URL: fred.stlouisfed.org
- What it is: Over 800,000 economic time series from dozens of sources: interest rates, GDP, employment, housing, trade. Maintained by the Federal Reserve Bank of St. Louis.
- How to access: Search and download directly, or use the
fredapiPython package. - Best for: Time-series analysis (Chapter 11), economic modeling, macroeconomic research.
D.4 Government and Demographics
U.S. Census Bureau
- URL: data.census.gov
- What it is: Demographic, economic, and housing data for the United States at every geographic level --- national, state, county, zip code, and census tract.
- How to access: Use the data explorer on the website, or access via the Census API (free, requires a key from api.census.gov).
- Best for: Geographic analysis, demographic comparisons, merging census data with other datasets.
- Key datasets: American Community Survey (ACS), Decennial Census, Current Population Survey.
data.gov
- URL: data.gov
- What it is: The U.S. government's open data portal, aggregating datasets from hundreds of federal agencies. Over 300,000 datasets covering agriculture, climate, education, energy, health, public safety, and more.
- How to access: Search by topic or agency. Most datasets are downloadable as CSV, JSON, or through APIs.
- Best for: Finding U.S. government data on almost any topic. Quality and documentation vary widely.
European Union Open Data Portal
- URL: data.europa.eu
- What it is: Open data from EU institutions and member states. Covers agriculture, economy, environment, health, transportation, and more.
- How to access: Search and download directly. Many datasets are available in CSV and JSON.
D.5 Sports
Basketball Reference / Sports Reference
- URL: basketball-reference.com, baseball-reference.com, pro-football-reference.com
- What it is: Comprehensive statistics for professional sports. Player stats, team records, game logs, and historical data going back decades. This is where Priya gets her NBA data.
- How to access: Browse tables on the website and copy/paste or use the "Share & Export" option for CSV. For larger datasets, use the
basketball_reference_web_scraperPython package or scrape responsibly (see Chapter 13). - Best for: Sports analytics, time-series analysis of player performance, comparative analysis across eras.
- Example questions: Has three-point shooting really changed the NBA? Which stat best predicts team wins?
FiveThirtyEight Data
- URL: github.com/fivethirtyeight/data
- What it is: The datasets behind FiveThirtyEight's data journalism articles. Topics include politics, sports, economics, science, and culture. Each dataset comes with the article that used it, so you can compare your analysis to theirs.
- How to access: Clone or download from GitHub. Datasets are typically clean CSV files.
- Best for: Guided exploration (read the article, then try to reproduce or extend the analysis). Excellent for building portfolio projects (Chapter 34).
- Popular datasets: NFL Elo ratings, political polling averages, comic book character demographics, Bob Ross painting elements.
D.6 Social Science and Education
IPUMS (Integrated Public Use Microdata Series)
- URL: ipums.org
- What it is: Harmonized census and survey data from around the world. Individual-level microdata from the U.S. Census, American Community Survey, Current Population Survey, and international censuses.
- How to access: Create a free account, select the variables and years you want, and download a customized extract.
- Best for: In-depth demographic research, longitudinal studies, social science analysis.
National Center for Education Statistics (NCES)
- URL: nces.ed.gov
- What it is: U.S. education data: school enrollments, graduation rates, test scores, teacher salaries, college costs, and financial aid.
- How to access: Use the data tools on the website to create custom tables, or download complete datasets.
- Best for: Jordan's investigation into grading patterns. Education policy analysis, school comparisons.
General Social Survey (GSS)
- URL: gss.norc.org
- What it is: A nationally representative survey of American adults conducted since 1972. Covers attitudes on social issues, demographics, religion, politics, and quality of life.
- How to access: Download from the GSS Data Explorer. Available in multiple formats.
- Best for: Tracking social attitude changes over time, demographic analysis.
D.7 Environment and Climate
NOAA Climate Data Online
- URL: ncdc.noaa.gov/cdo-web
- What it is: Historical weather and climate data for the United States and the world. Temperature, precipitation, wind speed, and extreme weather events.
- How to access: Order custom datasets through the web interface (free, but may take minutes to process). Station-level data is also available through APIs.
- Best for: Time-series analysis, geographic patterns, climate trend visualization.
NASA Earth Data
- URL: earthdata.nasa.gov
- What it is: Satellite and ground-based observations of the Earth: atmosphere, oceans, land, ice, and more.
- How to access: Create a free Earthdata account. Data is available through various platforms and APIs.
- Best for: Advanced projects involving geospatial data, remote sensing, environmental science.
EPA Environmental Datasets
- URL: epa.gov/data
- What it is: Air quality, water quality, toxic releases, greenhouse gas emissions, and environmental health data for the United States.
- How to access: Download from the EPA website or use the EPA's API services.
- Best for: Environmental justice analysis, pollution tracking, regulatory compliance studies.
D.8 Tips for Working with New Datasets
Before diving into analysis, always follow this checklist:
-
Read the documentation. What do the columns mean? How was the data collected? What time period does it cover? What is the unit of observation (one row = one person? one country-year? one transaction?)?
-
Check the license. Most datasets listed here are free for educational and research use. Some have restrictions on commercial use or redistribution. Always check.
-
Assess data quality. How many rows? How many missing values? Are there obvious errors? Run
df.info(),df.describe(), anddf.isnull().sum()before anything else. -
Understand the sampling. Is this a census (every member of the population) or a sample? If it is a sample, how were participants selected? Selection bias can invalidate your conclusions.
-
Consider the context. Who collected this data and why? What might be missing? What populations might be underrepresented?
-
Start small. If the dataset has millions of rows, work with a random sample of a few thousand while developing your analysis. Scale up once your code works.
-
Cite your sources. When you use a dataset in a project or report, include the source, URL, date accessed, and any relevant documentation. This is a professional standard (and an ethical one).
D.9 Building Your Own Dataset
Sometimes the perfect dataset does not exist. In that case, you can create your own:
- Web scraping (Chapter 13): Extract data from web pages using BeautifulSoup or similar tools. Always check the site's terms of service and robots.txt.
- APIs (Chapter 13): Many organizations offer APIs for programmatic data access. Twitter, Reddit, Spotify, the New York Times, and many government agencies all offer APIs.
- Surveys: Tools like Google Forms or SurveyMonkey let you collect original data. But designing a good survey is harder than it looks --- consult resources on survey methodology first.
- Manual data entry: For small datasets, sometimes the fastest path is a spreadsheet. Just be consistent with formatting.
The datasets listed here are accurate as of the publication of this book. URLs and availability may change over time. If a link is broken, search for the dataset name --- the data has likely moved rather than disappeared.