Appendix E: Data Sources Guide

"The best algorithm in the world is useless without the right data to feed it." — Prof. Diane Okonkwo, in-class remark during Chapter 4

One of the most common obstacles facing MBA students working on AI and machine learning projects is finding appropriate data. This appendix is your comprehensive directory. Whether you are completing a textbook exercise, building your capstone AI Transformation Plan (Chapter 39), or exploring a new industry application on your own, the resources below will help you locate, access, and work with high-quality datasets.

Every entry follows a consistent format:

Name — The resource or dataset
Source — Where to find it (described by platform and path rather than full URL, since web addresses change)
Description — What it contains and why it matters
Size — Approximate scale
Format — File types and structures
Access — Licensing, registration requirements, and cost
Relevant Chapters — Where this dataset connects to textbook material

1. General-Purpose Dataset Platforms

These platforms aggregate datasets across industries and use cases. They are your first stop when searching for project data.

1.1 Kaggle Datasets

Source: kaggle.com → Datasets tab
Description: The world's largest data science community hosts over 200,000 public datasets spanning every industry and data type. Kaggle also hosts competitions with curated, well-documented datasets that include leaderboards and community notebooks demonstrating analysis approaches. The "Getting Started" competitions (Titanic, House Prices, Digit Recognizer) are ideal for first-time ML practitioners.
Size: Ranges from a few KB to hundreds of GB per dataset
Format: CSV, JSON, SQLite, images, text files; downloadable as ZIP archives
Access: Free with Kaggle account; some competition datasets have usage restrictions
Relevant Chapters: Virtually all chapters — especially Ch. 5 (EDA), Ch. 7 (classification), Ch. 8 (regression), Ch. 9 (clustering), Ch. 11 (model evaluation)

1.2 UCI Machine Learning Repository

Source: archive.ics.uci.edu
Description: One of the oldest and most widely cited sources in machine learning research, maintained by the University of California, Irvine. Contains over 600 datasets used in thousands of academic papers. Each dataset includes metadata about the number of instances, features, data types, and associated tasks (classification, regression, clustering). The repository is an excellent source for benchmark datasets where you want to compare your results against published literature.
Size: Typically small to medium (hundreds to hundreds of thousands of rows)
Format: CSV, ARFF, data/names file pairs
Access: Free; no registration required for most datasets
Relevant Chapters: Ch. 7 (classification), Ch. 8 (regression), Ch. 9 (unsupervised learning), Ch. 13 (neural networks)

1.3 Google Dataset Search

Source: datasetsearch.research.google.com
Description: A search engine specifically for datasets, indexing metadata from thousands of repositories worldwide. Think of it as "Google for data." It surfaces datasets from government portals, academic institutions, news organizations, and commercial providers. Particularly useful when you have a specific topic in mind but do not know which repository might host the data.
Size: Varies by linked source
Format: Varies by linked source
Access: Free search; individual dataset access depends on the hosting institution
Relevant Chapters: All chapters — useful for capstone research (Ch. 39) and industry application exploration (Ch. 36)

1.4 Data.gov (United States)

Source: data.gov
Description: The U.S. government's open data portal, containing over 300,000 datasets from federal agencies including the Census Bureau, Bureau of Labor Statistics, Department of Education, EPA, FDA, and more. An invaluable source for demographic data, economic indicators, health statistics, environmental measurements, and transportation records. Many of these datasets power the analytics behind public policy decisions.
Size: Ranges from small reference tables to multi-gigabyte longitudinal surveys
Format: CSV, XML, JSON, Shapefile (geospatial), API endpoints
Access: Free; no registration required; U.S. Government Open Data License
Relevant Chapters: Ch. 4 (data strategy), Ch. 8 (regression), Ch. 16 (time series), Ch. 36 (industry applications)

1.5 Data.gov.uk (United Kingdom)

Source: data.gov.uk
Description: The UK's open data portal, offering over 50,000 datasets from government departments, NHS, Transport for London, and local councils. Particularly strong in healthcare outcomes, transportation, education, and public finance. The data quality and documentation standards are generally excellent, reflecting the UK's early leadership in the open data movement.
Size: Small to large
Format: CSV, JSON, ODS, API endpoints
Access: Free; Open Government Licence
Relevant Chapters: Ch. 4 (data strategy), Ch. 28 (AI regulation — UK approach), Ch. 36 (industry applications)

1.6 AWS Open Data Registry

Source: registry.opendata.aws
Description: Amazon Web Services hosts a curated registry of large-scale public datasets available directly on AWS infrastructure. Includes satellite imagery (Landsat, Sentinel), genomic data (1000 Genomes Project), weather data (NOAA), and more. The key advantage is that these datasets are already stored in S3 buckets, so if you are working in an AWS environment, you can access them without downloading.
Size: Often very large (terabytes); subset access is available
Format: Parquet, CSV, GeoTIFF, FASTQ, and other domain-specific formats
Access: Free to access; AWS compute costs apply if processing in the cloud
Relevant Chapters: Ch. 15 (computer vision), Ch. 16 (time series), Ch. 23 (cloud AI services)

1.7 Hugging Face Datasets

Source: huggingface.co → Datasets tab
Description: A rapidly growing hub of over 100,000 datasets, with particular strength in NLP and generative AI tasks. Datasets are loadable in a single line of Python code using the datasets library. Community-contributed datasets include text classification, question answering, summarization, translation, dialogue, and multimodal tasks. Also hosts evaluation benchmarks like GLUE, SuperGLUE, and MMLU.
Size: Varies; many are pre-split into train/validation/test sets
Format: Arrow format (via datasets library); exportable to CSV, JSON, Parquet
Access: Free; some datasets require agreeing to terms of use
Relevant Chapters: Ch. 14 (NLP), Ch. 17 (LLMs), Ch. 18 (multimodal), Ch. 19–20 (prompt engineering)

1.8 Eurostat

Source: ec.europa.eu/eurostat
Description: The statistical office of the European Union, providing high-quality data on the economy, population, trade, industry, agriculture, and environment for all EU member states. Offers extensive time series data going back decades, with standardized definitions across countries. Excellent for cross-country comparative analyses.
Size: Medium to large; many datasets have tens of thousands of time-indexed observations
Format: CSV, TSV, JSON-stat, SDMX; accessible via API
Access: Free; no registration required
Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series), Ch. 28 (EU AI regulation context)

1.9 World Bank Open Data

Source: data.worldbank.org
Description: Global development indicators covering 217 countries and spanning more than 60 years. Includes GDP, population, health expenditure, education enrollment, poverty rates, infrastructure metrics, and hundreds of other indicators. The data is clean, well-documented, and updated regularly. Ideal for international business analysis and macro-level forecasting.
Size: Medium (thousands of country-year observations per indicator)
Format: CSV, Excel, XML; API available
Access: Free; Creative Commons Attribution 4.0 license
Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series), Ch. 36 (industry applications)

1.10 Papers With Code Datasets

Source: paperswithcode.com → Datasets section
Description: Links academic papers to the datasets and code used to produce their results. When you find a state-of-the-art method for a particular task, this platform shows you the exact dataset on which it was benchmarked. Contains over 6,000 datasets across machine learning, NLP, computer vision, and more.
Size: Varies
Format: Varies
Access: Free; individual dataset licenses vary
Relevant Chapters: Ch. 11 (model evaluation — benchmarking), Ch. 13 (neural networks), Ch. 14 (NLP), Ch. 15 (computer vision)

1.11 FiveThirtyEight Data

Source: github.com/fivethirtyeight/data
Description: Datasets behind the data journalism published by FiveThirtyEight. Topics include politics, sports, economics, science, and culture. These datasets are typically clean, small enough to work with on a laptop, and accompanied by a published article that provides context and analytical framing. Excellent for practicing EDA and storytelling with data.
Size: Small to medium (hundreds to tens of thousands of rows)
Format: CSV
Access: Free; available under various open licenses
Relevant Chapters: Ch. 5 (EDA), Ch. 7 (classification), Ch. 8 (regression)

1.12 GitHub Awesome Public Datasets

Source: github.com → search "awesome-public-datasets"
Description: A community-curated list of high-quality public datasets organized by topic, including agriculture, biology, climate, economics, education, energy, government, healthcare, natural language, social networks, sports, and transportation. Not a data host itself, but a comprehensive index pointing to the best datasets across the internet.
Size: Varies
Format: Varies
Access: Free index; individual dataset access varies
Relevant Chapters: All chapters; especially useful for capstone topic exploration (Ch. 39)

1.13 Microsoft Research Open Data

Source: msropendata.com
Description: Curated datasets from Microsoft Research projects spanning NLP, computer vision, social science, and information retrieval. Includes datasets used in influential research papers and often comes with baseline code. Quality and documentation are consistently strong.
Size: Medium to large
Format: CSV, TSV, image files, specialized formats
Access: Free; some datasets require agreeing to a research use license
Relevant Chapters: Ch. 14 (NLP), Ch. 15 (computer vision), Ch. 37 (emerging technologies)

1.14 OpenML

Source: openml.org
Description: An open platform for sharing datasets, machine learning tasks, and experimental results. Contains thousands of datasets with standardized metadata, enabling automated benchmarking. Integrates with scikit-learn via the openml Python package, allowing you to load datasets directly into your modeling pipeline.
Size: Mostly small to medium
Format: ARFF, CSV; directly loadable via Python API
Access: Free with account registration
Relevant Chapters: Ch. 7 (classification), Ch. 8 (regression), Ch. 11 (model evaluation)

1.15 Datahub.io

Source: datahub.io
Description: A publishing platform for open data, maintained by Open Knowledge Foundation. Hosts curated "core datasets" covering economic indicators, geodata, reference tables, and other commonly needed data. These core datasets follow the Frictionless Data standard, ensuring consistent formatting and metadata.
Size: Small to medium
Format: CSV, JSON; Frictionless Data packages
Access: Free; open licenses
Relevant Chapters: Ch. 4 (data strategy), Ch. 5 (EDA)

2. Industry-Specific Datasets

2.1 Retail and E-Commerce

Instacart Market Basket Analysis

Source: Kaggle → "Instacart Market Basket Analysis" competition
Description: Over 3 million grocery orders from more than 200,000 Instacart users. Contains order sequences, product details, department and aisle information, and reorder flags. A rich dataset for market basket analysis, recommendation systems, and customer behavior modeling. This is the dataset Ravi Mehta references when discussing Athena Retail Group's early recommendation experiments.
Size: ~3.4 million orders across 200,000+ users; approximately 550 MB
Format: CSV (six relational tables)
Access: Free with Kaggle account; Instacart competition rules apply
Relevant Chapters: Ch. 9 (clustering), Ch. 10 (recommendation systems), Ch. 24 (marketing AI)

Online Retail Dataset (UCI)

Source: UCI Machine Learning Repository → "Online Retail" or "Online Retail II"
Description: Transactional data from a UK-based online retailer specializing in all-occasion gifts, covering December 2009 through December 2011. Contains invoice numbers, stock codes, descriptions, quantities, unit prices, customer IDs, and country codes. The classic dataset for RFM analysis, customer segmentation, and churn prediction.
Size: ~541,000 transactions; approximately 45 MB
Format: Excel (XLSX) or CSV
Access: Free; no registration required
Relevant Chapters: Ch. 5 (EDA), Ch. 7 (classification — churn), Ch. 9 (customer segmentation)

Amazon Product Reviews

Source: Available via multiple mirrors; search "Amazon Product Reviews Dataset" or "Amazon Customer Reviews Dataset" on AWS Open Data
Description: Tens of millions of product reviews spanning multiple product categories, with star ratings, review text, helpfulness votes, and product metadata. The sheer scale makes it ideal for NLP tasks including sentiment analysis, aspect-based opinion mining, and review summarization.
Size: Ranges from millions to over 130 million reviews depending on the version
Format: JSON, TSV, Parquet
Access: Free; various open licenses depending on the specific version
Relevant Chapters: Ch. 14 (NLP — ReviewAnalyzer), Ch. 17 (LLM summarization), Ch. 24 (customer experience)

Walmart Sales Forecasting

Source: Kaggle → "Walmart Recruiting — Store Sales Forecasting"
Description: Historical sales data for 45 Walmart stores, including department-level weekly sales, store size, type, temperature, fuel price, CPI, unemployment rate, and promotional markdown events. Designed for demand forecasting exercises with rich external features.
Size: ~420,000 records across 45 stores and 99 departments
Format: CSV
Access: Free with Kaggle account
Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series forecasting — Prophet workflow)

Olist Brazilian E-Commerce Dataset

Source: Kaggle → "Brazilian E-Commerce Public Dataset by Olist"
Description: Real commercial data from the Olist marketplace in Brazil, covering 100,000 orders placed between 2016 and 2018. Includes customer, seller, product, order, payment, review, and geolocation data across eight relational tables. An excellent dataset for practicing data integration and building end-to-end analytics pipelines.
Size: ~100,000 orders; approximately 45 MB total
Format: CSV (eight relational tables)
Access: Free with Kaggle account; Creative Commons license
Relevant Chapters: Ch. 5 (EDA), Ch. 7 (classification), Ch. 10 (recommendations), Ch. 24 (customer experience)

Rossmann Store Sales

Source: Kaggle → "Rossmann Store Sales"
Description: Sales data for 1,115 Rossmann drugstores across Germany, including store type, assortment level, competition distance, and promotion flags. Ideal for regression and time series forecasting with rich categorical features and calendar effects.
Size: ~1 million daily observations
Format: CSV
Access: Free with Kaggle account
Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series forecasting)

H&M Personalized Fashion Recommendations

Source: Kaggle → "H&M Personalized Fashion Recommendations"
Description: Purchase history, customer metadata, and article (product) metadata with images for H&M fashion retail. Contains over 1.3 million customers and 100,000+ articles. Ideal for building hybrid recommendation systems that combine transactional data with image features.
Size: ~31 million transactions; images add several GB
Format: CSV, JPEG images
Access: Free with Kaggle account; competition terms
Relevant Chapters: Ch. 10 (recommendation systems), Ch. 15 (computer vision), Ch. 18 (multimodal)

RetailRocket E-Commerce Dataset

Source: Kaggle → "RetailRocket Recommender System Dataset"
Description: Behavioral data (views, add-to-cart, transactions) from a real e-commerce website over 4.5 months. Contains event timestamps, visitor IDs, item IDs, and item properties. Useful for session-based recommendation and conversion funnel analysis.
Size: ~2.7 million events
Format: CSV
Access: Free with Kaggle account
Relevant Chapters: Ch. 10 (recommendation systems), Ch. 24 (marketing AI)

2.2 Financial Services

Yahoo Finance (via yfinance Python Library)

Source: Install via pip install yfinance; data sourced from Yahoo Finance
Description: Historical and real-time stock prices, dividends, splits, options chains, and basic financial statements for publicly traded companies worldwide. The yfinance Python library provides a clean, programmatic interface. This is the simplest way to get financial time series data for classroom exercises.
Size: Varies by ticker and date range; typically thousands of daily observations per stock
Format: Returns pandas DataFrames directly in Python
Access: Free; rate-limited; Yahoo's terms of service apply
Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series forecasting), Ch. 34 (ROI measurement)

Federal Reserve Economic Data (FRED)

Source: fred.stlouisfed.org; Python access via fredapi library
Description: Over 800,000 economic time series from 107 sources, maintained by the Federal Reserve Bank of St. Louis. Includes GDP, inflation, unemployment, interest rates, housing starts, consumer confidence, and hundreds of other macroeconomic indicators. The gold standard for U.S. economic data.
Size: Hundreds to thousands of observations per series (monthly, quarterly, or annual)
Format: CSV, Excel, JSON; API available (requires free API key)
Access: Free; requires API key registration for programmatic access
Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series), Ch. 34 (ROI — economic context)

SEC EDGAR Filings

Source: sec.gov/edgar; full-text search at efts.sec.gov/LATEST/search-index
Description: All public company filings with the U.S. Securities and Exchange Commission, including 10-K annual reports, 10-Q quarterly reports, 8-K current reports, proxy statements, and more. An invaluable source for NLP projects — extracting risk factors, analyzing management discussion sections, or building financial sentiment models.
Size: Millions of filings spanning decades; individual filings range from a few KB to several MB
Format: HTML, XBRL, plain text; API available
Access: Free; no registration required; rate limits apply
Relevant Chapters: Ch. 14 (NLP — text analysis), Ch. 17 (LLM — document summarization), Ch. 36 (finance applications)

Lending Club Loan Data

Source: Kaggle → search "Lending Club"; historical data archives
Description: Detailed loan origination and performance data from the peer-to-peer lending platform. Includes loan amount, interest rate, employment information, FICO score ranges, delinquency history, and loan status (current, default, charged off). A canonical dataset for credit risk modeling and binary classification.
Size: Over 2 million loans; approximately 1.5 GB
Format: CSV
Access: Free via Kaggle; historical data archives available
Relevant Chapters: Ch. 7 (classification — default prediction), Ch. 25 (bias detection — lending disparities), Ch. 26 (explainability)

Credit Card Fraud Detection Dataset

Source: Kaggle → "Credit Card Fraud Detection"
Description: Anonymized credit card transactions from September 2013, containing 492 frauds out of 284,807 transactions. Features V1–V28 are PCA-transformed for confidentiality; only time and amount are unmasked. The canonical dataset for learning about class imbalance, precision-recall tradeoffs, and anomaly detection.
Size: 284,807 transactions; approximately 150 MB
Format: CSV
Access: Free with Kaggle account; Open Database License
Relevant Chapters: Ch. 7 (classification — imbalanced classes), Ch. 11 (model evaluation — precision vs. recall), Ch. 29 (security)

European Central Bank Exchange Rates

Source: ecb.europa.eu → Statistical Data Warehouse
Description: Daily reference exchange rates for major world currencies against the euro, updated daily. Clean, reliable time series data ideal for introductory forecasting exercises.
Size: Thousands of daily observations per currency pair
Format: CSV, XML; API (SDMX)
Access: Free; no registration
Relevant Chapters: Ch. 16 (time series), Ch. 36 (finance applications)

Quandl / Nasdaq Data Link

Source: data.nasdaq.com (formerly Quandl)
Description: A wide-ranging financial and economic data platform aggregating data from hundreds of sources including central banks, exchanges, and alternative data providers. The free tier covers core financial datasets; premium tiers offer alternative data (satellite imagery, shipping, employment).
Size: Varies by dataset
Format: CSV, JSON, XML; Python library available
Access: Free tier with API key; premium datasets require subscription
Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series), Ch. 23 (APIs)

German Credit Data (UCI)

Source: UCI Machine Learning Repository → "Statlog (German Credit Data)"
Description: 1,000 loan applicants classified as good or bad credit risk, with 20 attributes including credit history, employment, housing, and existing accounts. A classic dataset for fairness-aware machine learning, as it includes protected attributes (age, gender, foreign worker status) that enable bias auditing.
Size: 1,000 instances; 20 attributes
Format: CSV / data file
Access: Free; no registration
Relevant Chapters: Ch. 7 (classification), Ch. 25 (bias detection — BiasDetector), Ch. 26 (fairness and explainability)

2.3 Healthcare

MIMIC-III (Medical Information Mart for Intensive Care)

Source: physionet.org → MIMIC-III Clinical Database
Description: A large, freely available database of de-identified health records from over 40,000 patients who stayed in critical care units at Beth Israel Deaconess Medical Center between 2001 and 2012. Contains vital signs, laboratory results, medications, caregiver notes, procedures, diagnostic codes, and mortality outcomes. One of the most important open datasets in healthcare AI.
Size: ~40,000 patients; 26 relational tables; approximately 6 GB compressed
Format: CSV (relational tables); PostgreSQL database available
Access: Free after completing a data ethics course (CITI training) and signing a data use agreement via PhysioNet
Relevant Chapters: Ch. 7 (classification — readmission prediction), Ch. 14 (NLP — clinical notes), Ch. 29 (privacy — de-identification), Ch. 36 (healthcare AI)

WHO Global Health Observatory (GHO)

Source: who.int → GHO Data Repository; API available via Athena API
Description: Health statistics for 194 WHO member states, covering mortality, disease burden, health systems, environmental health, nutrition, and the Sustainable Development Goals. Time series data spanning decades for many indicators.
Size: Thousands of indicator series across 194 countries
Format: CSV, JSON; API available
Access: Free; no registration required
Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series), Ch. 36 (healthcare applications)

CMS Medicare Provider Utilization and Payment Data

Source: data.cms.gov
Description: Detailed information on services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals, including utilization, charges, and payments. Enables analysis of healthcare spending patterns, provider variation, and potential fraud detection.
Size: Millions of records across multiple years and data types
Format: CSV
Access: Free; no registration required
Relevant Chapters: Ch. 7 (classification — fraud detection), Ch. 9 (clustering — provider segmentation), Ch. 36 (healthcare applications)

NIH National Library of Medicine (PubMed / PMC)

Source: pubmed.ncbi.nlm.nih.gov; bulk download via FTP
Description: Over 36 million citations and abstracts for biomedical literature from MEDLINE, life science journals, and online books. PubMed Central (PMC) provides full-text articles. An enormous corpus for biomedical NLP, citation analysis, and knowledge graph construction.
Size: 36+ million abstracts; millions of full-text articles
Format: XML, plain text; E-utilities API available
Access: Free; API key recommended for bulk access
Relevant Chapters: Ch. 14 (NLP — biomedical text mining), Ch. 17 (LLM applications)

Heart Disease Dataset (UCI / Cleveland)

Source: UCI Machine Learning Repository → "Heart Disease"; also on Kaggle
Description: Clinical attributes (age, sex, chest pain type, resting blood pressure, cholesterol, fasting blood sugar, ECG results, maximum heart rate, exercise-induced angina) for 303 patients, with a target variable indicating the presence of heart disease. Compact, clean, and widely used for introductory classification exercises.
Size: 303 instances; 14 attributes
Format: CSV
Access: Free; no registration
Relevant Chapters: Ch. 7 (classification), Ch. 13 (neural networks), Ch. 26 (explainability)

COVID-19 Open Datasets

Source: Multiple sources — Johns Hopkins CSSE (github.com/CSSEGISandData/COVID-19), Our World in Data (ourworldindata.org), WHO COVID-19 Dashboard data
Description: Daily case counts, deaths, vaccinations, testing rates, and policy responses for countries and sub-national regions worldwide. One of the most intensively analyzed datasets in history, with extensive community analysis and modeling available for reference.
Size: Hundreds of thousands of daily observations across 200+ countries
Format: CSV, JSON
Access: Free; various open licenses
Relevant Chapters: Ch. 16 (time series), Ch. 30 (responsible AI — pandemic response), Ch. 36 (healthcare)

Chest X-Ray Images (Pneumonia Detection)

Source: Kaggle → "Chest X-Ray Images (Pneumonia)"
Description: 5,863 chest X-ray images labeled as normal or pneumonia (bacterial or viral). Organized into train, validation, and test directories. A popular introductory dataset for medical image classification with convolutional neural networks.
Size: 5,863 images; approximately 1.2 GB
Format: JPEG images in directory structure
Access: Free with Kaggle account
Relevant Chapters: Ch. 15 (computer vision), Ch. 36 (healthcare AI)

Drug Review Dataset

Source: UCI Machine Learning Repository → "Drug Review Dataset (Drugs.com)"
Description: Over 200,000 patient drug reviews scraped from Drugs.com, including the drug name, condition being treated, the review text, a 10-point rating, and the date. Useful for sentiment analysis, topic modeling, and understanding patient experience.
Size: ~215,000 reviews
Format: TSV
Access: Free; no registration
Relevant Chapters: Ch. 14 (NLP — sentiment analysis), Ch. 24 (customer experience)

2.4 Manufacturing and Industrial

NASA Turbofan Engine Degradation Simulation (C-MAPSS)

Source: NASA Prognostics Center of Excellence → data repository; also available on Kaggle
Description: Simulated run-to-failure data for turbofan jet engines under different operating conditions and fault modes. Contains multivariate time series from 21 sensors, with the goal of predicting Remaining Useful Life (RUL). The canonical dataset for predictive maintenance research.
Size: Four sub-datasets (FD001–FD004) with 100 to 249 engines each
Format: Text files (space-delimited)
Access: Free; public domain
Relevant Chapters: Ch. 8 (regression — RUL prediction), Ch. 16 (time series), Ch. 36 (manufacturing applications)

Steel Plates Faults Dataset

Source: UCI Machine Learning Repository → "Steel Plates Faults"
Description: 1,941 instances of steel plate faults classified into seven categories (pastry, Z-scratch, K-scratch, stains, dirtiness, bumps, other faults). Each instance has 27 numeric features describing the fault geometry and steel properties. A clean multi-class classification problem relevant to quality control in manufacturing.
Size: 1,941 instances; 27 features; 7 classes
Format: CSV
Access: Free; no registration
Relevant Chapters: Ch. 7 (classification — multi-class), Ch. 36 (manufacturing)

SECOM Semiconductor Manufacturing Dataset

Source: UCI Machine Learning Repository → "SECOM"
Description: 1,567 observations from a semiconductor manufacturing process, with 590 sensor measurements and a binary pass/fail target. Characterized by high dimensionality, missing values, and severe class imbalance — all of which mirror real-world manufacturing data challenges.
Size: 1,567 instances; 590 features
Format: CSV / space-delimited text
Access: Free; no registration
Relevant Chapters: Ch. 7 (classification — imbalanced classes), Ch. 5 (EDA — handling missing data), Ch. 36 (manufacturing)

Predictive Maintenance Dataset (Microsoft)

Source: Azure AI Gallery / Kaggle → "Predictive Maintenance" or "AI4I 2020 Predictive Maintenance"
Description: Synthetic but realistic dataset simulating machine failures in an industrial setting. The AI4I version contains 10,000 data points with features like air temperature, process temperature, rotational speed, torque, and tool wear, plus failure mode labels (heat dissipation, power, overstrain, tool wear, random). Designed for teaching predictive maintenance without real industrial data access constraints.
Size: 10,000 instances; 14 features
Format: CSV
Access: Free; no registration
Relevant Chapters: Ch. 7 (classification), Ch. 36 (manufacturing applications)

Tennessee Eastman Process Simulation

Source: Available via academic repositories and GitHub; search "Tennessee Eastman Process dataset"
Description: Simulated data from a chemical process model developed by Eastman Chemical Company. Contains 52 process variables under normal operation and 21 different fault conditions. Widely used in process monitoring, fault detection, and industrial AI research.
Size: Thousands of time-indexed observations per simulation run
Format: DAT, CSV (depending on version)
Access: Free; academic use
Relevant Chapters: Ch. 9 (anomaly detection), Ch. 16 (time series), Ch. 36 (manufacturing)

Bosch Production Line Performance

Source: Kaggle → "Bosch Production Line Performance"
Description: Anonymous measurements from Bosch's manufacturing process, with the task of predicting internal failures. Features over 4,000 numeric, categorical, and timestamp columns across three files. Extremely high-dimensional and sparse, representing a realistic large-scale manufacturing data challenge.
Size: ~1.2 million parts; 4,000+ features; approximately 13 GB
Format: CSV
Access: Free with Kaggle account; competition terms
Relevant Chapters: Ch. 7 (classification), Ch. 5 (EDA — high-dimensional data), Ch. 36 (manufacturing)

2.5 NLP and Text Data

Common Crawl

Source: commoncrawl.org
Description: A massive, open corpus of web crawl data collected since 2008. Each monthly crawl contains billions of web pages, totaling petabytes of data. The raw material from which many large language models are trained. For classroom use, the pre-processed WET (text-only) files are most practical; use the index to download subsets rather than the full corpus.
Size: Petabytes in total; individual monthly crawls are tens of terabytes
Format: WARC, WAT, WET (compressed text)
Access: Free; stored on AWS Open Data
Relevant Chapters: Ch. 14 (NLP), Ch. 17 (LLMs — training data), Ch. 25 (bias — web-sourced biases)

Wikipedia Dumps

Source: dumps.wikimedia.org
Description: Complete database dumps of all Wikimedia projects, including the full text of Wikipedia in all languages. The English Wikipedia dump contains millions of articles and is one of the most commonly used corpora for NLP research, knowledge base construction, and entity linking.
Size: English Wikipedia: ~20 GB compressed (text only); full with metadata: 80+ GB
Format: XML, SQL dumps; community tools exist for conversion to plain text or JSON
Access: Free; Creative Commons Attribution-ShareAlike license
Relevant Chapters: Ch. 14 (NLP — text processing), Ch. 17 (LLMs), Ch. 21 (RAG pipeline — knowledge base)

Yelp Open Dataset

Source: yelp.com/dataset
Description: A subset of Yelp's business, review, and user data, containing over 6.9 million reviews, 150,000 businesses, and 200,000 photos across 11 metropolitan areas. Includes review text, star ratings, business attributes, check-in data, and user social network information. Rich enough for sentiment analysis, recommendation systems, graph analysis, and photo classification.
Size: ~6.9 million reviews; approximately 10 GB total
Format: JSON
Access: Free after agreeing to Yelp Dataset License; restricted to academic and educational use
Relevant Chapters: Ch. 14 (NLP — ReviewAnalyzer), Ch. 10 (recommendations), Ch. 9 (clustering)

IMDB Movie Reviews

Source: Available via Hugging Face Datasets, Kaggle, or Stanford AI Lab
Description: 50,000 movie reviews split evenly between positive and negative sentiment. One of the most popular benchmark datasets for binary sentiment classification. Clean, well-balanced, and extensively benchmarked across hundreds of published models.
Size: 50,000 reviews; approximately 80 MB
Format: Text files or CSV; also available via datasets library
Access: Free; no registration required
Relevant Chapters: Ch. 14 (NLP — sentiment classification), Ch. 13 (neural networks — text classification)

SQuAD (Stanford Question Answering Dataset)

Source: rajpurkar.github.io/SQuAD-explorer; also on Hugging Face
Description: A reading comprehension dataset consisting of questions posed on Wikipedia articles, where the answer to each question is a segment of the corresponding passage. SQuAD 2.0 adds unanswerable questions, requiring models to know when they do not know. A key benchmark for evaluating LLM comprehension capabilities.
Size: 100,000+ question-answer pairs (SQuAD 1.1); 150,000+ including unanswerable (SQuAD 2.0)
Format: JSON
Access: Free; CC BY-SA 4.0
Relevant Chapters: Ch. 14 (NLP), Ch. 17 (LLMs — evaluation), Ch. 21 (RAG pipeline)

GLUE and SuperGLUE Benchmarks

Source: gluebenchmark.com / super.gluebenchmark.com; available via Hugging Face
Description: Collections of diverse NLP tasks designed to evaluate and compare language understanding models. GLUE includes sentiment analysis (SST-2), textual entailment (MNLI, RTE), paraphrase detection (QQP, MRPC), and more. SuperGLUE adds harder tasks like reading comprehension with commonsense reasoning (ReCoRD), word sense disambiguation (WiC), and causal reasoning (COPA).
Size: Varies by task; typically thousands to hundreds of thousands of examples
Format: TSV, JSON; loadable via Hugging Face datasets
Access: Free
Relevant Chapters: Ch. 14 (NLP — benchmarking), Ch. 17 (LLMs — evaluation), Ch. 11 (model evaluation)

AG News Classification Dataset

Source: Kaggle; also on Hugging Face
Description: 127,600 news articles categorized into four classes: World, Sports, Business, and Science/Technology. Constructed from the AG corpus of news articles collected by ComeToMyHead academic search engine. A clean, balanced multi-class text classification dataset ideal for introductory NLP exercises.
Size: 127,600 articles (120,000 train / 7,600 test)
Format: CSV
Access: Free
Relevant Chapters: Ch. 14 (NLP — text classification), Ch. 7 (classification)

20 Newsgroups

Source: scikit-learn built-in (sklearn.datasets.fetch_20newsgroups); also on various mirrors
Description: A collection of approximately 20,000 newsgroup documents partitioned across 20 different newsgroup categories. One of the original benchmark datasets for text classification and topic modeling. Loadable directly from scikit-learn with a single function call.
Size: ~20,000 documents; 20 categories
Format: Text; directly loadable in scikit-learn
Access: Free; public domain
Relevant Chapters: Ch. 9 (clustering — topic modeling), Ch. 14 (NLP — text classification)

2.6 Computer Vision

ImageNet (ILSVRC)

Source: image-net.org
Description: Over 14 million hand-annotated images organized according to the WordNet hierarchy. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) subset, with 1,000 categories and 1.2 million training images, is the benchmark that catalyzed the deep learning revolution in 2012 when AlexNet won the competition. Pre-trained ImageNet weights are available in every major deep learning framework.
Size: Full: 14+ million images; ILSVRC: ~1.2 million images; approximately 150 GB
Format: JPEG images with XML annotations
Access: Free for research and educational use; requires registration
Relevant Chapters: Ch. 13 (neural networks — transfer learning), Ch. 15 (computer vision)

COCO (Common Objects in Context)

Source: cocodataset.org
Description: Over 330,000 images with 80 object categories, 91 stuff categories, and 250,000 people with keypoints. Annotations include bounding boxes, segmentation masks, and captions. The standard benchmark for object detection, instance segmentation, and image captioning.
Size: ~330,000 images; approximately 25 GB
Format: JPEG images; JSON annotations
Access: Free; Creative Commons Attribution 4.0
Relevant Chapters: Ch. 15 (computer vision — object detection), Ch. 18 (multimodal — image captioning)

Open Images Dataset (Google)

Source: storage.googleapis.com/openimages
Description: A dataset of approximately 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. Covers 600 object classes with over 16 million bounding boxes. More diverse and larger than COCO, though annotations can be noisier.
Size: ~9 million images; hundreds of GB
Format: JPEG images; CSV annotations
Access: Free; Creative Commons Attribution 4.0
Relevant Chapters: Ch. 15 (computer vision), Ch. 18 (multimodal)

CIFAR-10 and CIFAR-100

Source: cs.toronto.edu → CIFAR datasets; built into most deep learning frameworks
Description: CIFAR-10 contains 60,000 32×32 color images in 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). CIFAR-100 has the same image count but spans 100 fine-grained classes grouped into 20 superclasses. Small enough to train on a laptop CPU/GPU, making them ideal for rapid prototyping and experimentation.
Size: 60,000 images each; approximately 170 MB (CIFAR-10), 170 MB (CIFAR-100)
Format: Binary pickle files; also available as PNG images; built into PyTorch and TensorFlow
Access: Free; no registration
Relevant Chapters: Ch. 13 (neural networks — CNNs), Ch. 15 (computer vision)

MNIST and Fashion-MNIST

Source: yann.lecun.com/exdb/mnist (MNIST); github.com/zalandoresearch/fashion-mnist (Fashion-MNIST)
Description: MNIST is the "hello world" of machine learning: 70,000 grayscale 28×28 images of handwritten digits (0–9). Fashion-MNIST is a drop-in replacement with 10 categories of clothing items (t-shirt, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, ankle boot), designed to be slightly more challenging. Both are built into virtually every ML framework.
Size: 70,000 images each; approximately 11 MB (MNIST), 30 MB (Fashion-MNIST)
Format: IDX format; directly loadable in scikit-learn, PyTorch, TensorFlow
Access: Free; no registration
Relevant Chapters: Ch. 7 (classification intro), Ch. 13 (neural networks)

CelebA (CelebFaces Attributes Dataset)

Source: mmlab.ie.cuhk.edu.hk → CelebA; also on Kaggle
Description: Over 200,000 celebrity face images annotated with 40 binary attributes (smiling, wearing glasses, male, etc.) and 5 landmark locations. Widely used for facial attribute prediction, face generation, and bias analysis in facial recognition systems.
Size: 202,599 images; approximately 1.4 GB
Format: JPEG images; CSV annotations
Access: Free for non-commercial research
Relevant Chapters: Ch. 15 (computer vision), Ch. 25 (bias — facial recognition disparities)

Stanford Cars / Oxford Pets / Flowers

Source: Various Stanford and Oxford research group pages; also on Kaggle and TensorFlow Datasets
Description: Fine-grained visual classification datasets covering 196 car models, 37 pet breeds, and 102 flower species respectively. Designed for transfer learning experiments where distinguishing between visually similar subcategories is the challenge.
Size: 8,000–16,000 images each
Format: JPEG images with labels
Access: Free for research
Relevant Chapters: Ch. 15 (computer vision — transfer learning), Ch. 13 (neural networks)

3. APIs for Real-Time Data

Real-time data enables you to build applications that respond to current conditions rather than analyzing historical snapshots. This section covers APIs that provide programmatic access to live or frequently updated data.

Note on API keys: Most APIs below require a free registration and API key. Store your keys in environment variables or a .env file — never hardcode them in scripts or commit them to version control. See Chapter 23 for API integration patterns.

3.1 Financial Data APIs

Alpha Vantage

Source: alphavantage.co; Python wrapper: alpha_vantage package
Description: Free API providing real-time and historical stock prices, forex rates, cryptocurrency prices, and over 50 technical indicators. The free tier allows 25 requests per day. A cleaner alternative to scraping financial websites.
Access: Free tier with API key; premium plans available
Relevant Chapters: Ch. 16 (time series), Ch. 23 (cloud AI services and APIs)

Polygon.io

Source: polygon.io; Python client available
Description: Financial market data API covering stocks, options, forex, and crypto. Provides real-time and historical data including trades, quotes, bars, and reference data. More robust rate limits than free alternatives.
Access: Free basic tier; paid tiers for real-time data
Relevant Chapters: Ch. 16 (time series), Ch. 23 (APIs)

Reddit API (via PRAW)

Source: reddit.com/dev/api; Python wrapper: praw package
Description: Access Reddit posts, comments, user data, and subreddit metadata. Useful for sentiment analysis, trend detection, and social network analysis. The praw (Python Reddit API Wrapper) library simplifies authentication and pagination.
Access: Free with Reddit app registration; rate-limited
Relevant Chapters: Ch. 14 (NLP — sentiment analysis), Ch. 24 (marketing — social listening)

X (Twitter) API

Source: developer.x.com
Description: Access tweets, user profiles, followers, and trending topics. The API has undergone significant changes in access levels and pricing since 2023. The free tier provides limited write access and basic search; the basic paid tier provides additional read access. Important for understanding real-time public discourse, but plan for access constraints.
Access: Free tier with limited access; paid tiers for broader access
Relevant Chapters: Ch. 14 (NLP), Ch. 24 (marketing AI), Ch. 16 (time series — trend detection)

NewsAPI

Source: newsapi.org; Python client available
Description: Aggregates headlines and articles from over 150,000 news sources and blogs worldwide. Searchable by keyword, source, language, and date range. The free tier is limited to development use (100 requests/day, articles up to 1 month old); production use requires a paid plan.
Access: Free developer tier; paid for production
Relevant Chapters: Ch. 14 (NLP — news classification), Ch. 17 (LLMs — summarization)

3.3 Weather and Environmental APIs

OpenWeatherMap API

Source: openweathermap.org/api; Python wrapper available
Description: Current weather, 5-day forecasts, historical data, and weather alerts for any location worldwide. Supports queries by city name, coordinates, or ZIP code. The free tier provides current weather and 5-day forecasts.
Access: Free tier (60 calls/minute); paid tiers for historical data and advanced features
Relevant Chapters: Ch. 8 (regression — weather as feature), Ch. 16 (time series), Ch. 23 (APIs)

NOAA Climate Data Online

Source: ncdc.noaa.gov; Climate Data Online API
Description: Historical weather and climate data from NOAA's extensive observation network, including daily summaries, monthly normals, and extreme weather events. Covers stations worldwide with records going back over a century.
Access: Free with API token
Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series)

3.4 Geospatial APIs

OpenStreetMap (Overpass API)

Source: openstreetmap.org; Overpass API at overpass-api.de
Description: Volunteer-contributed global mapping data including roads, buildings, businesses, parks, transit stops, and more. The Overpass API allows complex spatial queries. Python libraries osmnx and geopy provide convenient access.
Access: Free; Open Database License
Relevant Chapters: Ch. 36 (industry applications — logistics, real estate)

Google Maps Platform APIs

Source: developers.google.com/maps
Description: Geocoding, directions, places, distance matrix, and static/dynamic maps. Essential for location-based analytics — calculating drive times, finding nearby competitors, and geocoding addresses.
Access: Free tier ($200/month credit); pay-as-you-go beyond that
Relevant Chapters: Ch. 23 (cloud AI services), Ch. 36 (retail and logistics)

3.5 Government and Open Data APIs

Census Bureau API

Source: api.census.gov
Description: Programmatic access to U.S. Census data, including the American Community Survey, Decennial Census, Economic Census, and Population Estimates. Supports detailed geographic queries down to the census tract level. Python wrapper cenpy simplifies access.
Access: Free with API key
Relevant Chapters: Ch. 8 (regression — demographic features), Ch. 9 (clustering — geographic segmentation), Ch. 36 (public sector)

Bureau of Labor Statistics (BLS) API

Source: bls.gov/developers
Description: Employment, unemployment, consumer prices, producer prices, wages, and productivity data for the United States. Time series data with monthly, quarterly, and annual frequency.
Access: Free; API key recommended for higher rate limits
Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series)

3.6 AI and Machine Learning APIs

OpenAI API

Source: platform.openai.com
Description: Access to GPT-4, GPT-3.5, DALL-E, Whisper, and embedding models. The primary API used in Chapters 17, 19, and 20 for LLM integration, prompt engineering, and AI-powered workflows. Supports chat completions, function calling, vision, and fine-tuning.
Access: Pay-per-use; free trial credits available for new accounts
Relevant Chapters: Ch. 17 (LLMs), Ch. 19–20 (prompt engineering), Ch. 21 (RAG pipeline), Ch. 23 (cloud AI services)

Hugging Face Inference API

Source: huggingface.co → Inference API
Description: Run inference on thousands of open-source models hosted on Hugging Face, including text generation, classification, summarization, translation, image classification, and more. Free tier available for experimentation; rate-limited.
Access: Free tier; paid Inference Endpoints for production
Relevant Chapters: Ch. 14 (NLP), Ch. 17 (LLMs — open-source alternatives), Ch. 23 (APIs)

4. Synthetic Data Tools

When real-world data is unavailable, too sensitive, too expensive, or insufficiently diverse, synthetic data can fill the gap. This section covers tools for generating realistic fake data and when to use them.

4.1 When to Use Synthetic Data

Synthetic data is appropriate in the following scenarios:

Privacy constraints — You need data that mimics real patient, customer, or financial records without exposing actual individuals (Ch. 29)
Class imbalance — You need more examples of a rare class (fraud, equipment failure) to train a balanced model (Ch. 7, Ch. 11)
Prototyping — You want to build and test a pipeline before real data is available (Ch. 12, Ch. 33)
Augmentation — You want to expand a small training set while preserving statistical properties (Ch. 13, Ch. 15)
Testing and QA — You need diverse test cases to validate data pipelines, dashboards, or applications (Ch. 12)
Bias mitigation — You want to generate counterfactual examples to test fairness (Ch. 25, Ch. 26)

Synthetic data is not appropriate when:

Regulatory requirements demand auditable, real-world evidence
The underlying distribution is unknown or highly complex and you cannot validate the synthetic data's fidelity
Downstream decisions have high stakes and the synthetic data has not been rigorously evaluated against held-out real data

4.2 Python Libraries for Synthetic Data

Faker

Source: pip install faker; documentation at faker.readthedocs.io
Description: Generates fake but realistic personal data — names, addresses, phone numbers, emails, dates, job titles, credit card numbers, text paragraphs, and more. Supports localization for 60+ languages. The simplest tool for populating databases, testing pipelines, and creating demonstration datasets. NK uses this in her early data strategy exercises at Athena Retail Group to prototype customer databases without accessing production data.
Use Case: Creating realistic test data for pipeline development, database schema testing, and UI prototyping
Example:

from faker import Faker
import pandas as pd

fake = Faker()
Faker.seed(42)  # Reproducibility

customers = pd.DataFrame({
    'customer_id': range(1, 1001),
    'name': [fake.name() for _ in range(1000)],
    'email': [fake.email() for _ in range(1000)],
    'city': [fake.city() for _ in range(1000)],
    'signup_date': [fake.date_between(start_date='-3y') for _ in range(1000)],
    'lifetime_value': [round(fake.random.uniform(10, 5000), 2) for _ in range(1000)]
})

Relevant Chapters: Ch. 3 (Python basics), Ch. 4 (data strategy — prototype datasets), Ch. 12 (MLOps — test data)

SDV (Synthetic Data Vault)

Source: pip install sdv; documentation at docs.sdv.dev
Description: A suite of machine learning models that learn the statistical properties of your real data and generate new synthetic rows that preserve those properties. Supports single tables (using Gaussian copulas or CTGAN), multi-table relational databases, and time series. The most sophisticated open-source synthetic data library available.
Use Case: Generating privacy-safe replicas of sensitive datasets (healthcare, finance) while preserving column correlations, distributions, and constraints
Key Models:
GaussianCopulaSynthesizer — Fast, parametric; best for well-behaved numerical data
CTGANSynthesizer — GAN-based; handles mixed data types and complex distributions
TVAESynthesizer — VAE-based; often faster convergence than CTGAN
Example:

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=5000)

Relevant Chapters: Ch. 7 (classification — augmenting rare classes), Ch. 25 (bias — counterfactual generation), Ch. 29 (privacy — de-identification alternatives)

scikit-learn Synthetic Data Generators

Source: Built into scikit-learn (sklearn.datasets)
Description: Functions for generating synthetic classification, regression, and clustering datasets with controlled properties. Useful when you need data with specific characteristics (number of informative features, noise level, class balance) for teaching or experimentation.
Key Functions:
make_classification() — Binary or multi-class classification with controllable separability
make_regression() — Linear regression targets with specified noise
make_blobs() — Gaussian clusters for clustering exercises
make_moons() / make_circles() — Non-linearly separable classes for demonstrating kernel methods and neural networks
Example:

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    weights=[0.95, 0.05],  # Imbalanced classes
    random_state=42
)

Relevant Chapters: Ch. 7 (classification — controlled experiments), Ch. 9 (clustering — make_blobs), Ch. 11 (model evaluation — learning curves), Ch. 13 (neural networks — make_moons)

NumPy / SciPy Statistical Distributions

Source: Built into NumPy (numpy.random) and SciPy (scipy.stats)
Description: Generate random samples from any standard probability distribution — normal, uniform, exponential, Poisson, beta, gamma, binomial, and dozens more. Essential for Monte Carlo simulations, bootstrapping, and creating controlled experimental datasets.
Use Case: Simulating business scenarios (demand variability, arrival rates, defect probabilities), creating noise for model robustness testing
Relevant Chapters: Ch. 8 (regression — simulated data), Ch. 16 (time series — synthetic series), Ch. 34 (ROI — Monte Carlo simulation)

4.3 Commercial and Cloud Synthetic Data Platforms

Gretel.ai

Source: gretel.ai; Python SDK available via pip install gretel-client
Description: A cloud platform for generating synthetic data using deep learning models. Offers pre-built APIs for tabular data synthesis, text generation, and data transformation. Includes built-in privacy evaluation (SQS — Synthetic Quality Score) that measures fidelity and privacy simultaneously. The free tier supports small datasets and limited API calls.
Features: Tabular synthesis (LSTM, ACTGAN), text synthesis, data de-identification, privacy metrics
Access: Free developer tier; paid plans for larger datasets and production use
Relevant Chapters: Ch. 29 (privacy), Ch. 12 (MLOps — data pipeline testing)

MOSTLY AI

Source: mostly.ai; cloud platform with free tier
Description: Enterprise-focused synthetic data platform specializing in privacy-safe data generation for tabular and sequential data. Generates synthetic data that maintains correlations, distributions, and temporal patterns from the original while providing formal privacy guarantees. Includes quality reports comparing synthetic and original data distributions.
Features: Smart imputation, rare category handling, time series, linked tables
Access: Free tier (100K rows/day); paid enterprise plans
Relevant Chapters: Ch. 29 (privacy), Ch. 4 (data strategy)

Tonic.ai

Source: tonic.ai
Description: Focused on creating realistic, de-identified replicas of production databases for development and testing. Preserves referential integrity across tables, maintains data format constraints, and supports subset generation. Designed for engineering and data teams that need production-like data in non-production environments.
Features: Subsetting, referential integrity, format-preserving transformations, consistency
Access: Free community edition; paid plans for enterprise
Relevant Chapters: Ch. 12 (MLOps — staging environments), Ch. 29 (privacy)

Hazy

Source: hazy.com
Description: Enterprise synthetic data platform with a focus on regulated industries (financial services, healthcare, insurance). Uses differentially private generative models to produce synthetic data with formal privacy guarantees. Includes data quality reports and compliance documentation.
Features: Differential privacy, compliance reporting, sequential data, multi-table support
Access: Enterprise pricing; demo available
Relevant Chapters: Ch. 29 (privacy — differential privacy), Ch. 28 (AI regulation — compliance)

Synthesis AI

Source: synthesis.ai
Description: Specializes in synthetic image and video data for computer vision. Generates photorealistic faces, people, and environments with pixel-perfect labels (segmentation masks, depth maps, landmarks). Useful when collecting and labeling real images is expensive or raises privacy concerns (e.g., facial recognition training).
Features: Photorealistic rendering, automatic labeling, diversity controls, scene customization
Access: Paid; custom pricing
Relevant Chapters: Ch. 15 (computer vision — training data), Ch. 25 (bias — diverse training sets)

5. Data for Specific Book Exercises

This section maps key exercises and projects from each part of the textbook to recommended datasets. Use this as a quick reference when you need data for a specific assignment.

Part 1: Foundations of AI for Business (Chapters 1–6)

Exercise / Project	Recommended Dataset(s)	Notes
Ch. 3: pandas data manipulation practice	FiveThirtyEight datasets, Faker-generated data	Small, clean CSVs for learning syntax
Ch. 4: Data quality assessment exercise	Online Retail Dataset (UCI)	Contains missing values, duplicates, and data type issues — ideal for data quality auditing
Ch. 5: `EDAReport` builder	Olist Brazilian E-Commerce Dataset	Multiple related tables for joining; rich mix of numeric and categorical features
Ch. 5: Visualization exercises	Kaggle — "World Happiness Report"	Clean, well-structured, and produces compelling visualizations
Ch. 6: Business case assessment	Walmart Sales Forecasting	Clear business objective; quantifiable outcomes for ROI estimation

Part 2: Core Machine Learning for Business (Chapters 7–12)

Exercise / Project	Recommended Dataset(s)	Notes
Ch. 7: `ChurnClassifier`	Online Retail Dataset (UCI), Kaggle "Telco Customer Churn"	Telco Churn is a clean binary classification dataset with clear business interpretation
Ch. 7: Fraud detection exercise	Credit Card Fraud Detection (Kaggle)	Class imbalance; precision-recall tradeoffs
Ch. 8: `DemandForecaster`	Walmart Sales Forecasting, Rossmann Store Sales	Multi-store, multi-feature regression with calendar effects
Ch. 8: Price prediction exercise	Kaggle "House Prices"	Classic regression competition; rich feature engineering opportunities
Ch. 9: `CustomerSegmenter`	Online Retail Dataset (UCI), Olist	RFM features for clustering; validate against known segments
Ch. 10: `RecommendationEngine`	Instacart Market Basket Analysis, H&M Fashion	Market basket and collaborative filtering exercises
Ch. 11: `ModelEvaluator`	German Credit Data (UCI)	Multiple model comparison with fairness constraints
Ch. 12: MLOps pipeline exercise	Faker-generated data, scikit-learn `make_classification`	Synthetic data for controlled pipeline testing

Part 3: Deep Learning and Specialized AI (Chapters 13–18)

Exercise / Project	Recommended Dataset(s)	Notes
Ch. 13: Neural network from scratch	MNIST, Fashion-MNIST, CIFAR-10	Start with MNIST; graduate to CIFAR-10 for CNNs
Ch. 14: `ReviewAnalyzer`	Amazon Product Reviews, IMDB Reviews, Yelp Open Dataset	Sentiment analysis at different scales
Ch. 14: Text classification exercise	AG News, 20 Newsgroups	Multi-class text classification
Ch. 15: Image classification project	Chest X-Ray (Pneumonia), Stanford Cars, Oxford Pets	Transfer learning with pre-trained ImageNet models
Ch. 15: Object detection exercise	COCO	Standard bounding box and segmentation tasks
Ch. 16: Prophet forecasting	Yahoo Finance (`yfinance`), FRED	Financial time series with trend, seasonality, and holidays
Ch. 17: LLM integration	OpenAI API + SEC EDGAR filings	Document summarization and information extraction
Ch. 18: Multimodal exercise	H&M Fashion (images + metadata), COCO Captions	Combining image and text features

Part 4: Prompt Engineering and AI Tools (Chapters 19–24)

Exercise / Project	Recommended Dataset(s)	Notes
Ch. 19: `PromptBuilder` exercises	Any text dataset; Faker for generating test inputs	Focus is on prompt design, not data complexity
Ch. 20: `PromptChain` exercises	SEC EDGAR filings, NewsAPI	Multi-step reasoning over real documents
Ch. 21: RAG pipeline	Wikipedia dumps (subset), PubMed abstracts	Knowledge base for retrieval-augmented generation
Ch. 22: No-code/low-code AI projects	Kaggle "Titanic" (AutoML comparison)	Compare no-code tool results against manual approaches
Ch. 23: API integration lab	OpenAI API, Alpha Vantage, OpenWeatherMap	Practice API authentication, error handling, rate limiting
Ch. 24: Marketing AI project	Yelp Reviews, Reddit API, Olist reviews	Customer sentiment analysis and social listening

Part 5: AI Ethics, Bias, and Governance (Chapters 25–30)

Exercise / Project	Recommended Dataset(s)	Notes
Ch. 25: `BiasDetector`	German Credit Data (UCI), COMPAS recidivism data (ProPublica), CelebA	Datasets with known demographic disparities
Ch. 26: `ExplainabilityDashboard`	Lending Club loan data, Heart Disease Dataset	Models with high-stakes decisions requiring explanation
Ch. 27: Governance framework exercise	No specific dataset needed	Framework and checklist exercise
Ch. 28: Regulation analysis	EU AI Act text, NIST AI RMF documentation	Policy documents, not data
Ch. 29: Privacy exercise	MIMIC-III, Faker-generated PII	De-identification and privacy-preserving analytics
Ch. 30: Responsible AI audit	Any Ch. 7–9 model output + SDV synthetic data	Full audit pipeline: bias check, explainability, privacy

Part 6: AI Strategy and Organizational Transformation (Chapters 31–36)

Exercise / Project	Recommended Dataset(s)	Notes
Ch. 34: `AIROICalculator`	Walmart Sales (before/after modeling), FRED economic data	Quantifying business impact of AI interventions
Ch. 36: Industry deep dive	Industry-specific datasets from Section 2 above	Select based on your chosen industry vertical

Part 7 and Capstone (Chapters 37–40)

Exercise / Project	Recommended Dataset(s)	Notes
Ch. 39: `AIMaturityAssessment` + `TransformationRoadmapGenerator`	Faker for simulated organizational data; real company data if available	The capstone AI Transformation Plan benefits from combining real industry data with synthetic organizational metrics
Ch. 39: Capstone dataset selection	Use Google Dataset Search + this appendix	Choose datasets aligned with your transformation domain (retail, finance, healthcare, manufacturing)

Quick Reference: Dataset Selection Checklist

Before committing to a dataset for your project, run through this checklist:

Relevance — Does the dataset align with your business question? A technically interesting dataset that does not map to a real business problem will produce a weak project.
Size — Is the dataset large enough for your chosen method? Deep learning generally needs thousands of examples; classical ML can work with hundreds. Is it small enough to work with given your compute resources?
Quality — How much preprocessing is required? Some datasets are analysis-ready; others require significant cleaning. Factor this time into your project plan.
Documentation — Is there a data dictionary, codebook, or schema description? Poorly documented data leads to incorrect assumptions.
Recency — When was the data collected? A dataset from 2010 may not reflect current patterns, especially in fast-moving domains like social media or e-commerce.
License — Can you legally use the data for your intended purpose? Academic use, commercial use, and redistribution rights vary by dataset. Check before you build.
Ethical considerations — Does the data contain sensitive attributes (race, gender, health status)? If so, have appropriate protections been applied? Could your analysis cause harm to the populations represented? See Chapters 25–30 for frameworks.
Reproducibility — Can someone else access the same data to verify your results? APIs with changing data are harder to reproduce from than static CSV downloads. Consider saving a snapshot.

A Final Note on Data Ethics

As Prof. Okonkwo reminds us throughout this textbook, data is never neutral. Every dataset reflects the choices of the people who collected it — what they chose to measure, whom they chose to include, and how they chose to categorize the world. The datasets listed in this appendix are powerful tools for learning, but they are not free from bias, omission, or error.

When working with any dataset, ask yourself:

Who collected this data, and why?
Who is represented in this data, and who is missing?
What assumptions are embedded in the category definitions and labels?
What could go wrong if a model trained on this data were deployed in the real world?

These questions are not obstacles to your analysis — they are essential components of it. The most technically sophisticated model built on unexamined data is, as Tom learns at Athena Retail Group, a liability disguised as an asset.

"Data does not speak for itself. It speaks through the lens of whoever collected it, cleaned it, and chose to analyze it. Your job is to understand that lens." — Prof. Diane Okonkwo, Chapter 4 closing lecture