Appendix C: Free Business Datasets for Practice
Python for Business for Beginners: Coding for Every Person
All sources listed here are free, publicly accessible, and suitable for practice and portfolio projects. URLs and access patterns are current as of this writing. Government data sources are the most stable; commercial platforms may change access terms.
How to Find and Download Data
Before diving into specific sources, here are the general patterns for accessing each type:
Direct download: The source provides a CSV, Excel, or JSON file you can download directly to your computer or read with pd.read_csv(url).
API access: The source provides a programmatic interface. You make HTTP requests and receive JSON data. Usually requires a free API key.
Data portal search: You browse a catalog, find a dataset, and download it. No account needed for most government portals.
Web scraping: Data is on a public webpage but not available for download. Use requests + BeautifulSoup (see Chapter 22). Always check robots.txt first.
Government and Public Data
U.S. Census Bureau
URL: census.gov/data
API: api.census.gov (free key required, register at census.gov/developers)
What is available: - American Community Survey (ACS): demographic, income, housing, education, employment data by geography (state, county, zip code, census tract) - Decennial Census: population counts every 10 years - Economic Census: business data by industry and geography (every 5 years) - Business Patterns: employment and payroll by county and industry (annually)
Format: CSV download, API returns JSON
Typical use: Market sizing, customer demographic analysis, site selection, labor market research
Python access:
import requests
import pandas as pd
# Get median household income by state
url = "https://api.census.gov/data/2022/acs/acs5"
params = {
"get": "NAME,B19013_001E", # median household income
"for": "state:*",
"key": "YOUR_API_KEY",
}
response = requests.get(url, params=params)
data = response.json()
df = pd.DataFrame(data[1:], columns=data[0])
df.rename(columns={"B19013_001E": "median_income"}, inplace=True)
Licensing: Public domain. No restrictions on use.
Bureau of Labor Statistics (BLS)
URL: bls.gov/data
API: api.bls.gov (free key recommended for higher request limits)
What is available: - Consumer Price Index (CPI): inflation by category - Employment Situation: monthly jobs report data - Unemployment rate by state, metro area, industry - Occupational Employment and Wage Statistics: wages by job title - Producer Price Index (PPI): commodity and service prices - Job Openings and Labor Turnover (JOLTS): hiring and quit rates
Format: JSON API, CSV download via website
Typical use: Economic context for business analysis, compensation benchmarking, industry trends
Python access:
import requests
import pandas as pd
# Get CPI data (no API key needed for small requests)
url = "https://api.bls.gov/publicAPI/v2/timeseries/data/"
headers = {"Content-type": "application/json"}
data = {
"seriesid": ["CUUR0000SA0"], # All items CPI
"startyear": "2020",
"endyear": "2024",
"registrationkey": "YOUR_KEY",
}
response = requests.post(url, json=data, headers=headers)
series = response.json()["Results"]["series"][0]["data"]
df = pd.DataFrame(series)[["year", "period", "value"]]
Licensing: Public domain.
Bureau of Economic Analysis (BEA)
URL: bea.gov/data
API: apps.bea.gov/api (free key required)
What is available: - GDP by state and metropolitan area - Personal income by state - Industry GDP and employment - International trade data
Format: API returns JSON; some CSV downloads available
Typical use: Economic context for regional business analysis, industry sizing
Licensing: Public domain.
SEC EDGAR
URL: sec.gov/edgar
API: data.sec.gov/api
What is available: - All public company filings: 10-K annual reports, 10-Q quarterly reports, 8-K current events - Financial statements in XBRL (machine-readable format) - Ownership filings, proxy statements, prospectuses
Format: XBRL/JSON via API, HTML and PDF via filing viewer
Typical use: Competitive analysis, financial benchmarking, company research
Python access:
import requests
import pandas as pd
# Get company facts (financial data)
# CIK is the company identifier (look up at SEC EDGAR website)
headers = {"User-Agent": "Your Name yourname@email.com"}
# Apple's CIK is 0000320193
url = "https://data.sec.gov/api/xbrl/companyfacts/CIK0000320193.json"
response = requests.get(url, headers=headers)
data = response.json()
# Extract a specific financial metric
revenues = data["facts"]["us-gaap"]["Revenues"]["units"]["USD"]
df = pd.DataFrame(revenues)
Note: Always include a User-Agent header with your contact information as required by SEC EDGAR's fair access policy.
Licensing: Public domain.
World Bank Open Data
URL: data.worldbank.org
API: api.worldbank.org (no authentication required)
What is available: - GDP, GDP per capita, GDP growth for every country - Trade data, foreign direct investment - Population, health, education indicators - Business environment indicators (ease of doing business) - Hundreds of development indicators
Python package: wbgapi (pip install wbgapi)
import wbgapi as wb
import pandas as pd
# GDP per capita (current US$) for OECD countries
df = wb.data.DataFrame("NY.GDP.PCAP.CD", economy=wb.region.members("OED"))
Licensing: CC BY 4.0. Attribution required.
data.gov
URL: data.gov
What is available: The main portal for all US federal government open data. Thousands of datasets across agriculture, energy, finance, health, education, transportation, and more.
Typical use: Finding niche government datasets relevant to specific industries.
Search tip: Use the search filters by format (CSV is easiest to work with), agency, and category.
Licensing: Varies by dataset. Check the license field on each dataset's page.
Business and Competitive Datasets
Kaggle
URL: kaggle.com/datasets
What is available: - Thousands of community-contributed datasets across every domain - Retail sales, customer transactions, HR data, financial data, supply chain - Machine learning competition datasets (many are realistic business problems) - User-generated notebooks showing how others analyzed each dataset
Access: Free account required for download
Format: Usually CSV, sometimes JSON or SQLite
Quality warning: Quality varies enormously. Check the dataset's "usability score," the discussion section, and the data card before using. Many datasets have errors, inconsistencies, or unrealistic distributions.
Recommended starter datasets for business analysis: - "Superstore Sales Dataset" — retail orders with region, category, profitability - "Online Retail Dataset" — e-commerce transactions from a UK retailer (also on UCI) - "HR Analytics" — employee attrition data for churn analysis - "Rossmann Store Sales" — retail sales forecasting with real complexity
UCI Machine Learning Repository
URL: archive.ics.uci.edu
What is available: Curated datasets specifically designed for machine learning and statistical analysis. More rigorous quality control than Kaggle.
Recommended business datasets: - Online Retail dataset: 500,000+ transactions from a UK gift wholesaler - Bank Marketing dataset: Portuguese bank telemarketing campaign results - Adult Income dataset: census income data for classification
Format: CSV, sometimes custom format with documentation
Financial Data
Federal Reserve Economic Data (FRED)
URL: fred.stlouisfed.org
Python package: fredapi (pip install fredapi, free API key from FRED)
What is available: - 800,000+ economic time series - Interest rates (Fed Funds, Treasury yields, LIBOR/SOFR) - Inflation (CPI, PCE) - Employment and labor market data - Banking and financial indicators - Regional economic data
Python access:
from fredapi import Fred
import pandas as pd
fred = Fred(api_key="YOUR_FREE_API_KEY")
# 10-Year Treasury yield
treasuries = fred.get_series("DGS10")
# CPI (all items)
cpi = fred.get_series("CPIAUCSL", observation_start="2019-01-01")
# Unemployment rate
unemployment = fred.get_series("UNRATE")
Licensing: Public domain. No restrictions.
Yahoo Finance via yfinance
URL: finance.yahoo.com
Python package: yfinance (pip install yfinance)
What is available: - Historical price data for stocks, ETFs, indices, and currencies - Dividend and split history - Company financial statements (income statement, balance sheet, cash flow) - Options chains - Company info and metrics
Python access:
import yfinance as yf
# Get stock price history
msft = yf.Ticker("MSFT")
hist = msft.history(period="2y") # 2 years of daily prices
# Get financial statements
income_stmt = msft.financials # income statement
balance_sheet = msft.balance_sheet
cash_flow = msft.cashflow
# Download multiple tickers
data = yf.download(["MSFT", "AAPL", "GOOGL"], start="2022-01-01")
Note: yfinance is a third-party library that scrapes Yahoo Finance. Yahoo has not officially sanctioned it. It works reliably for historical data but may break if Yahoo changes their site. For production financial applications, use a commercial data provider.
Licensing: Data is sourced from Yahoo Finance. Suitable for research and personal use; commercial use restrictions may apply.
Alpha Vantage
URL: alphavantage.co
Free tier: 25 API calls per day
What is available: - Intraday and daily stock prices - Forex exchange rates - Cryptocurrency prices - Economic indicators - Company fundamentals
Python access:
import requests
import pandas as pd
api_key = "YOUR_FREE_KEY"
url = "https://www.alphavantage.co/query"
params = {
"function": "TIME_SERIES_DAILY",
"symbol": "MSFT",
"outputsize": "compact",
"apikey": api_key,
}
response = requests.get(url, params=params)
data = response.json()["Time Series (Daily)"]
df = pd.DataFrame(data).T # transpose: dates as rows
Geographic and Demographic Data
U.S. ZIP Code Data
URL: census.gov/geographies/reference-files
What is available: ZIP Code Tabulation Areas (ZCTAs) with demographic data from the ACS. Also available via the Census API.
Use cases: Customer geographic analysis, market segmentation, delivery zone analysis.
OpenStreetMap Data
URL: openstreetmap.org, export via overpass-api.de
What is available: Location data for businesses, roads, boundaries, amenities — for the entire world. Much more granular than census geography.
Python package: osmnx (pip install osmnx) for network and place data.
Google Maps Platform / OpenCage
URL: developers.google.com/maps (free tier available)
What is available: Geocoding (address to coordinates), reverse geocoding, distance matrices, places search.
Use case: Converting customer address lists to coordinates for mapping and distance analysis.
E-Commerce and Retail Datasets
Olist Brazilian E-Commerce (Kaggle)
A real dataset from a Brazilian e-commerce platform with 100,000 orders, products, sellers, reviews, and logistics data. One of the best public datasets for realistic e-commerce analysis.
Instacart Market Basket Analysis (Kaggle)
3 million grocery orders from Instacart's app, including product sequences, reorder patterns, and time data. Good for recommendation system projects and retail analytics.
UCI Online Retail Dataset
500,000+ transactions from a UK-based online gift retailer. Includes invoice numbers, product descriptions, quantities, prices, and customer IDs. Suitable for customer segmentation (RFM), cohort analysis, and revenue analysis.
# Direct URL access
df = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/"
"00352/Online%20Retail.xlsx",
engine="openpyxl",
)
HR and Employment Datasets
Bureau of Labor Statistics — Occupational Employment
See BLS entry above. The Occupational Employment and Wage Statistics (OEWS) program provides wages and employment by occupation for every US state and major metro area.
Kaggle HR Analytics Collection
Multiple HR datasets on Kaggle including: - IBM HR Analytics Employee Attrition (synthetic but realistic) - Human Resources Data Set (Hewitt Associates style) - Diversity and inclusion metrics datasets
LinkedIn Salary Insights (via LinkedIn API)
LinkedIn provides salary data by job title and location through its Talent Insights product. Free-tier API access is limited but available for developers.
Marketing and Customer Behavior Datasets
Google Analytics Sample Dataset (BigQuery)
Google provides a sample Google Analytics dataset in BigQuery. Free to query up to 1 TB per month. Covers website sessions, ecommerce transactions, and user behavior for the Google Merchandise Store.
HubSpot Open Data
Sample CRM data is available from HubSpot's developer documentation for testing their API integrations.
Synthetically Generated Customer Data
For customer behavior analysis practice when real data is unavailable, Faker (pip install Faker) and the sdv (Synthetic Data Vault) library generate realistic synthetic data.
from faker import Faker
import pandas as pd
import random
fake = Faker()
Faker.seed(42)
random.seed(42)
customers = []
for _ in range(1000):
customers.append({
"customer_id": fake.uuid4()[:8].upper(),
"name": fake.company(),
"email": fake.company_email(),
"state": fake.state_abbr(),
"industry": random.choice(["Technology", "Healthcare", "Finance", "Retail"]),
"annual_revenue": round(random.lognormvariate(13, 1.5), -3),
"employees": random.randint(10, 5000),
"created_date": fake.date_between(start_date="-3y", end_date="today"),
})
df = pd.DataFrame(customers)
Data Quality Warnings
When working with any of these sources, keep the following in mind:
Missing values are real, not mistakes. Government datasets often have missing values for good reasons: suppressed data to protect privacy, survey non-response, or genuine data unavailability. Document your missing value decisions.
Definitions change over time. The definition of "unemployment," "poverty," or "business" can change across survey years. When analyzing trends, verify that the definition is consistent across your date range.
Geography boundaries change. ZIP codes, county lines, and census tracts change between survey years. Be careful when joining geographic data across years.
Sample size matters for small geographies. ACS estimates for small counties or ZIP codes have large margins of error. The Census provides margin of error columns — include them in your analysis.
Kaggle dataset quality is inconsistent. Read the discussion section and check for issues before trusting any Kaggle dataset. Common problems: values that seem like they come from made-up data, distributions that are clearly synthetic, columns with unclear definitions.
Time zones and fiscal years vary. Transaction timestamps may be in different time zones. "Annual" financial data may use calendar year, fiscal year, or different 52-week period conventions across companies.
Currency and unit inconsistencies. International datasets may mix currencies without documentation. Check the data card or source documentation.
Always verify totals. After any data transformation, check that a known total (annual revenue, total records) matches what you expect. Small discrepancies often indicate a join or aggregation error.
Access patterns and availability for all sources are current as of this writing. Government data portals are the most stable; commercial platforms may change access terms without notice.