Appendix E: Data Sources Guide
"The best algorithm in the world is useless without the right data to feed it." — Prof. Diane Okonkwo, in-class remark during Chapter 4
One of the most common obstacles facing MBA students working on AI and machine learning projects is finding appropriate data. This appendix is your comprehensive directory. Whether you are completing a textbook exercise, building your capstone AI Transformation Plan (Chapter 39), or exploring a new industry application on your own, the resources below will help you locate, access, and work with high-quality datasets.
Every entry follows a consistent format:
- Name — The resource or dataset
- Source — Where to find it (described by platform and path rather than full URL, since web addresses change)
- Description — What it contains and why it matters
- Size — Approximate scale
- Format — File types and structures
- Access — Licensing, registration requirements, and cost
- Relevant Chapters — Where this dataset connects to textbook material
1. General-Purpose Dataset Platforms
These platforms aggregate datasets across industries and use cases. They are your first stop when searching for project data.
1.1 Kaggle Datasets
- Source: kaggle.com → Datasets tab
- Description: The world's largest data science community hosts over 200,000 public datasets spanning every industry and data type. Kaggle also hosts competitions with curated, well-documented datasets that include leaderboards and community notebooks demonstrating analysis approaches. The "Getting Started" competitions (Titanic, House Prices, Digit Recognizer) are ideal for first-time ML practitioners.
- Size: Ranges from a few KB to hundreds of GB per dataset
- Format: CSV, JSON, SQLite, images, text files; downloadable as ZIP archives
- Access: Free with Kaggle account; some competition datasets have usage restrictions
- Relevant Chapters: Virtually all chapters — especially Ch. 5 (EDA), Ch. 7 (classification), Ch. 8 (regression), Ch. 9 (clustering), Ch. 11 (model evaluation)
1.2 UCI Machine Learning Repository
- Source: archive.ics.uci.edu
- Description: One of the oldest and most widely cited sources in machine learning research, maintained by the University of California, Irvine. Contains over 600 datasets used in thousands of academic papers. Each dataset includes metadata about the number of instances, features, data types, and associated tasks (classification, regression, clustering). The repository is an excellent source for benchmark datasets where you want to compare your results against published literature.
- Size: Typically small to medium (hundreds to hundreds of thousands of rows)
- Format: CSV, ARFF, data/names file pairs
- Access: Free; no registration required for most datasets
- Relevant Chapters: Ch. 7 (classification), Ch. 8 (regression), Ch. 9 (unsupervised learning), Ch. 13 (neural networks)
1.3 Google Dataset Search
- Source: datasetsearch.research.google.com
- Description: A search engine specifically for datasets, indexing metadata from thousands of repositories worldwide. Think of it as "Google for data." It surfaces datasets from government portals, academic institutions, news organizations, and commercial providers. Particularly useful when you have a specific topic in mind but do not know which repository might host the data.
- Size: Varies by linked source
- Format: Varies by linked source
- Access: Free search; individual dataset access depends on the hosting institution
- Relevant Chapters: All chapters — useful for capstone research (Ch. 39) and industry application exploration (Ch. 36)
1.4 Data.gov (United States)
- Source: data.gov
- Description: The U.S. government's open data portal, containing over 300,000 datasets from federal agencies including the Census Bureau, Bureau of Labor Statistics, Department of Education, EPA, FDA, and more. An invaluable source for demographic data, economic indicators, health statistics, environmental measurements, and transportation records. Many of these datasets power the analytics behind public policy decisions.
- Size: Ranges from small reference tables to multi-gigabyte longitudinal surveys
- Format: CSV, XML, JSON, Shapefile (geospatial), API endpoints
- Access: Free; no registration required; U.S. Government Open Data License
- Relevant Chapters: Ch. 4 (data strategy), Ch. 8 (regression), Ch. 16 (time series), Ch. 36 (industry applications)
1.5 Data.gov.uk (United Kingdom)
- Source: data.gov.uk
- Description: The UK's open data portal, offering over 50,000 datasets from government departments, NHS, Transport for London, and local councils. Particularly strong in healthcare outcomes, transportation, education, and public finance. The data quality and documentation standards are generally excellent, reflecting the UK's early leadership in the open data movement.
- Size: Small to large
- Format: CSV, JSON, ODS, API endpoints
- Access: Free; Open Government Licence
- Relevant Chapters: Ch. 4 (data strategy), Ch. 28 (AI regulation — UK approach), Ch. 36 (industry applications)
1.6 AWS Open Data Registry
- Source: registry.opendata.aws
- Description: Amazon Web Services hosts a curated registry of large-scale public datasets available directly on AWS infrastructure. Includes satellite imagery (Landsat, Sentinel), genomic data (1000 Genomes Project), weather data (NOAA), and more. The key advantage is that these datasets are already stored in S3 buckets, so if you are working in an AWS environment, you can access them without downloading.
- Size: Often very large (terabytes); subset access is available
- Format: Parquet, CSV, GeoTIFF, FASTQ, and other domain-specific formats
- Access: Free to access; AWS compute costs apply if processing in the cloud
- Relevant Chapters: Ch. 15 (computer vision), Ch. 16 (time series), Ch. 23 (cloud AI services)
1.7 Hugging Face Datasets
- Source: huggingface.co → Datasets tab
- Description: A rapidly growing hub of over 100,000 datasets, with particular strength in NLP and generative AI tasks. Datasets are loadable in a single line of Python code using the
datasetslibrary. Community-contributed datasets include text classification, question answering, summarization, translation, dialogue, and multimodal tasks. Also hosts evaluation benchmarks like GLUE, SuperGLUE, and MMLU. - Size: Varies; many are pre-split into train/validation/test sets
- Format: Arrow format (via
datasetslibrary); exportable to CSV, JSON, Parquet - Access: Free; some datasets require agreeing to terms of use
- Relevant Chapters: Ch. 14 (NLP), Ch. 17 (LLMs), Ch. 18 (multimodal), Ch. 19–20 (prompt engineering)
1.8 Eurostat
- Source: ec.europa.eu/eurostat
- Description: The statistical office of the European Union, providing high-quality data on the economy, population, trade, industry, agriculture, and environment for all EU member states. Offers extensive time series data going back decades, with standardized definitions across countries. Excellent for cross-country comparative analyses.
- Size: Medium to large; many datasets have tens of thousands of time-indexed observations
- Format: CSV, TSV, JSON-stat, SDMX; accessible via API
- Access: Free; no registration required
- Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series), Ch. 28 (EU AI regulation context)
1.9 World Bank Open Data
- Source: data.worldbank.org
- Description: Global development indicators covering 217 countries and spanning more than 60 years. Includes GDP, population, health expenditure, education enrollment, poverty rates, infrastructure metrics, and hundreds of other indicators. The data is clean, well-documented, and updated regularly. Ideal for international business analysis and macro-level forecasting.
- Size: Medium (thousands of country-year observations per indicator)
- Format: CSV, Excel, XML; API available
- Access: Free; Creative Commons Attribution 4.0 license
- Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series), Ch. 36 (industry applications)
1.10 Papers With Code Datasets
- Source: paperswithcode.com → Datasets section
- Description: Links academic papers to the datasets and code used to produce their results. When you find a state-of-the-art method for a particular task, this platform shows you the exact dataset on which it was benchmarked. Contains over 6,000 datasets across machine learning, NLP, computer vision, and more.
- Size: Varies
- Format: Varies
- Access: Free; individual dataset licenses vary
- Relevant Chapters: Ch. 11 (model evaluation — benchmarking), Ch. 13 (neural networks), Ch. 14 (NLP), Ch. 15 (computer vision)
1.11 FiveThirtyEight Data
- Source: github.com/fivethirtyeight/data
- Description: Datasets behind the data journalism published by FiveThirtyEight. Topics include politics, sports, economics, science, and culture. These datasets are typically clean, small enough to work with on a laptop, and accompanied by a published article that provides context and analytical framing. Excellent for practicing EDA and storytelling with data.
- Size: Small to medium (hundreds to tens of thousands of rows)
- Format: CSV
- Access: Free; available under various open licenses
- Relevant Chapters: Ch. 5 (EDA), Ch. 7 (classification), Ch. 8 (regression)
1.12 GitHub Awesome Public Datasets
- Source: github.com → search "awesome-public-datasets"
- Description: A community-curated list of high-quality public datasets organized by topic, including agriculture, biology, climate, economics, education, energy, government, healthcare, natural language, social networks, sports, and transportation. Not a data host itself, but a comprehensive index pointing to the best datasets across the internet.
- Size: Varies
- Format: Varies
- Access: Free index; individual dataset access varies
- Relevant Chapters: All chapters; especially useful for capstone topic exploration (Ch. 39)
1.13 Microsoft Research Open Data
- Source: msropendata.com
- Description: Curated datasets from Microsoft Research projects spanning NLP, computer vision, social science, and information retrieval. Includes datasets used in influential research papers and often comes with baseline code. Quality and documentation are consistently strong.
- Size: Medium to large
- Format: CSV, TSV, image files, specialized formats
- Access: Free; some datasets require agreeing to a research use license
- Relevant Chapters: Ch. 14 (NLP), Ch. 15 (computer vision), Ch. 37 (emerging technologies)
1.14 OpenML
- Source: openml.org
- Description: An open platform for sharing datasets, machine learning tasks, and experimental results. Contains thousands of datasets with standardized metadata, enabling automated benchmarking. Integrates with scikit-learn via the
openmlPython package, allowing you to load datasets directly into your modeling pipeline. - Size: Mostly small to medium
- Format: ARFF, CSV; directly loadable via Python API
- Access: Free with account registration
- Relevant Chapters: Ch. 7 (classification), Ch. 8 (regression), Ch. 11 (model evaluation)
1.15 Datahub.io
- Source: datahub.io
- Description: A publishing platform for open data, maintained by Open Knowledge Foundation. Hosts curated "core datasets" covering economic indicators, geodata, reference tables, and other commonly needed data. These core datasets follow the Frictionless Data standard, ensuring consistent formatting and metadata.
- Size: Small to medium
- Format: CSV, JSON; Frictionless Data packages
- Access: Free; open licenses
- Relevant Chapters: Ch. 4 (data strategy), Ch. 5 (EDA)
2. Industry-Specific Datasets
2.1 Retail and E-Commerce
Instacart Market Basket Analysis
- Source: Kaggle → "Instacart Market Basket Analysis" competition
- Description: Over 3 million grocery orders from more than 200,000 Instacart users. Contains order sequences, product details, department and aisle information, and reorder flags. A rich dataset for market basket analysis, recommendation systems, and customer behavior modeling. This is the dataset Ravi Mehta references when discussing Athena Retail Group's early recommendation experiments.
- Size: ~3.4 million orders across 200,000+ users; approximately 550 MB
- Format: CSV (six relational tables)
- Access: Free with Kaggle account; Instacart competition rules apply
- Relevant Chapters: Ch. 9 (clustering), Ch. 10 (recommendation systems), Ch. 24 (marketing AI)
Online Retail Dataset (UCI)
- Source: UCI Machine Learning Repository → "Online Retail" or "Online Retail II"
- Description: Transactional data from a UK-based online retailer specializing in all-occasion gifts, covering December 2009 through December 2011. Contains invoice numbers, stock codes, descriptions, quantities, unit prices, customer IDs, and country codes. The classic dataset for RFM analysis, customer segmentation, and churn prediction.
- Size: ~541,000 transactions; approximately 45 MB
- Format: Excel (XLSX) or CSV
- Access: Free; no registration required
- Relevant Chapters: Ch. 5 (EDA), Ch. 7 (classification — churn), Ch. 9 (customer segmentation)
Amazon Product Reviews
- Source: Available via multiple mirrors; search "Amazon Product Reviews Dataset" or "Amazon Customer Reviews Dataset" on AWS Open Data
- Description: Tens of millions of product reviews spanning multiple product categories, with star ratings, review text, helpfulness votes, and product metadata. The sheer scale makes it ideal for NLP tasks including sentiment analysis, aspect-based opinion mining, and review summarization.
- Size: Ranges from millions to over 130 million reviews depending on the version
- Format: JSON, TSV, Parquet
- Access: Free; various open licenses depending on the specific version
- Relevant Chapters: Ch. 14 (NLP —
ReviewAnalyzer), Ch. 17 (LLM summarization), Ch. 24 (customer experience)
Walmart Sales Forecasting
- Source: Kaggle → "Walmart Recruiting — Store Sales Forecasting"
- Description: Historical sales data for 45 Walmart stores, including department-level weekly sales, store size, type, temperature, fuel price, CPI, unemployment rate, and promotional markdown events. Designed for demand forecasting exercises with rich external features.
- Size: ~420,000 records across 45 stores and 99 departments
- Format: CSV
- Access: Free with Kaggle account
- Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series forecasting — Prophet workflow)
Olist Brazilian E-Commerce Dataset
- Source: Kaggle → "Brazilian E-Commerce Public Dataset by Olist"
- Description: Real commercial data from the Olist marketplace in Brazil, covering 100,000 orders placed between 2016 and 2018. Includes customer, seller, product, order, payment, review, and geolocation data across eight relational tables. An excellent dataset for practicing data integration and building end-to-end analytics pipelines.
- Size: ~100,000 orders; approximately 45 MB total
- Format: CSV (eight relational tables)
- Access: Free with Kaggle account; Creative Commons license
- Relevant Chapters: Ch. 5 (EDA), Ch. 7 (classification), Ch. 10 (recommendations), Ch. 24 (customer experience)
Rossmann Store Sales
- Source: Kaggle → "Rossmann Store Sales"
- Description: Sales data for 1,115 Rossmann drugstores across Germany, including store type, assortment level, competition distance, and promotion flags. Ideal for regression and time series forecasting with rich categorical features and calendar effects.
- Size: ~1 million daily observations
- Format: CSV
- Access: Free with Kaggle account
- Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series forecasting)
H&M Personalized Fashion Recommendations
- Source: Kaggle → "H&M Personalized Fashion Recommendations"
- Description: Purchase history, customer metadata, and article (product) metadata with images for H&M fashion retail. Contains over 1.3 million customers and 100,000+ articles. Ideal for building hybrid recommendation systems that combine transactional data with image features.
- Size: ~31 million transactions; images add several GB
- Format: CSV, JPEG images
- Access: Free with Kaggle account; competition terms
- Relevant Chapters: Ch. 10 (recommendation systems), Ch. 15 (computer vision), Ch. 18 (multimodal)
RetailRocket E-Commerce Dataset
- Source: Kaggle → "RetailRocket Recommender System Dataset"
- Description: Behavioral data (views, add-to-cart, transactions) from a real e-commerce website over 4.5 months. Contains event timestamps, visitor IDs, item IDs, and item properties. Useful for session-based recommendation and conversion funnel analysis.
- Size: ~2.7 million events
- Format: CSV
- Access: Free with Kaggle account
- Relevant Chapters: Ch. 10 (recommendation systems), Ch. 24 (marketing AI)
2.2 Financial Services
Yahoo Finance (via yfinance Python Library)
- Source: Install via
pip install yfinance; data sourced from Yahoo Finance - Description: Historical and real-time stock prices, dividends, splits, options chains, and basic financial statements for publicly traded companies worldwide. The
yfinancePython library provides a clean, programmatic interface. This is the simplest way to get financial time series data for classroom exercises. - Size: Varies by ticker and date range; typically thousands of daily observations per stock
- Format: Returns pandas DataFrames directly in Python
- Access: Free; rate-limited; Yahoo's terms of service apply
- Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series forecasting), Ch. 34 (ROI measurement)
Federal Reserve Economic Data (FRED)
- Source: fred.stlouisfed.org; Python access via
fredapilibrary - Description: Over 800,000 economic time series from 107 sources, maintained by the Federal Reserve Bank of St. Louis. Includes GDP, inflation, unemployment, interest rates, housing starts, consumer confidence, and hundreds of other macroeconomic indicators. The gold standard for U.S. economic data.
- Size: Hundreds to thousands of observations per series (monthly, quarterly, or annual)
- Format: CSV, Excel, JSON; API available (requires free API key)
- Access: Free; requires API key registration for programmatic access
- Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series), Ch. 34 (ROI — economic context)
SEC EDGAR Filings
- Source: sec.gov/edgar; full-text search at efts.sec.gov/LATEST/search-index
- Description: All public company filings with the U.S. Securities and Exchange Commission, including 10-K annual reports, 10-Q quarterly reports, 8-K current reports, proxy statements, and more. An invaluable source for NLP projects — extracting risk factors, analyzing management discussion sections, or building financial sentiment models.
- Size: Millions of filings spanning decades; individual filings range from a few KB to several MB
- Format: HTML, XBRL, plain text; API available
- Access: Free; no registration required; rate limits apply
- Relevant Chapters: Ch. 14 (NLP — text analysis), Ch. 17 (LLM — document summarization), Ch. 36 (finance applications)
Lending Club Loan Data
- Source: Kaggle → search "Lending Club"; historical data archives
- Description: Detailed loan origination and performance data from the peer-to-peer lending platform. Includes loan amount, interest rate, employment information, FICO score ranges, delinquency history, and loan status (current, default, charged off). A canonical dataset for credit risk modeling and binary classification.
- Size: Over 2 million loans; approximately 1.5 GB
- Format: CSV
- Access: Free via Kaggle; historical data archives available
- Relevant Chapters: Ch. 7 (classification — default prediction), Ch. 25 (bias detection — lending disparities), Ch. 26 (explainability)
Credit Card Fraud Detection Dataset
- Source: Kaggle → "Credit Card Fraud Detection"
- Description: Anonymized credit card transactions from September 2013, containing 492 frauds out of 284,807 transactions. Features V1–V28 are PCA-transformed for confidentiality; only time and amount are unmasked. The canonical dataset for learning about class imbalance, precision-recall tradeoffs, and anomaly detection.
- Size: 284,807 transactions; approximately 150 MB
- Format: CSV
- Access: Free with Kaggle account; Open Database License
- Relevant Chapters: Ch. 7 (classification — imbalanced classes), Ch. 11 (model evaluation — precision vs. recall), Ch. 29 (security)
European Central Bank Exchange Rates
- Source: ecb.europa.eu → Statistical Data Warehouse
- Description: Daily reference exchange rates for major world currencies against the euro, updated daily. Clean, reliable time series data ideal for introductory forecasting exercises.
- Size: Thousands of daily observations per currency pair
- Format: CSV, XML; API (SDMX)
- Access: Free; no registration
- Relevant Chapters: Ch. 16 (time series), Ch. 36 (finance applications)
Quandl / Nasdaq Data Link
- Source: data.nasdaq.com (formerly Quandl)
- Description: A wide-ranging financial and economic data platform aggregating data from hundreds of sources including central banks, exchanges, and alternative data providers. The free tier covers core financial datasets; premium tiers offer alternative data (satellite imagery, shipping, employment).
- Size: Varies by dataset
- Format: CSV, JSON, XML; Python library available
- Access: Free tier with API key; premium datasets require subscription
- Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series), Ch. 23 (APIs)
German Credit Data (UCI)
- Source: UCI Machine Learning Repository → "Statlog (German Credit Data)"
- Description: 1,000 loan applicants classified as good or bad credit risk, with 20 attributes including credit history, employment, housing, and existing accounts. A classic dataset for fairness-aware machine learning, as it includes protected attributes (age, gender, foreign worker status) that enable bias auditing.
- Size: 1,000 instances; 20 attributes
- Format: CSV / data file
- Access: Free; no registration
- Relevant Chapters: Ch. 7 (classification), Ch. 25 (bias detection —
BiasDetector), Ch. 26 (fairness and explainability)
2.3 Healthcare
MIMIC-III (Medical Information Mart for Intensive Care)
- Source: physionet.org → MIMIC-III Clinical Database
- Description: A large, freely available database of de-identified health records from over 40,000 patients who stayed in critical care units at Beth Israel Deaconess Medical Center between 2001 and 2012. Contains vital signs, laboratory results, medications, caregiver notes, procedures, diagnostic codes, and mortality outcomes. One of the most important open datasets in healthcare AI.
- Size: ~40,000 patients; 26 relational tables; approximately 6 GB compressed
- Format: CSV (relational tables); PostgreSQL database available
- Access: Free after completing a data ethics course (CITI training) and signing a data use agreement via PhysioNet
- Relevant Chapters: Ch. 7 (classification — readmission prediction), Ch. 14 (NLP — clinical notes), Ch. 29 (privacy — de-identification), Ch. 36 (healthcare AI)
WHO Global Health Observatory (GHO)
- Source: who.int → GHO Data Repository; API available via Athena API
- Description: Health statistics for 194 WHO member states, covering mortality, disease burden, health systems, environmental health, nutrition, and the Sustainable Development Goals. Time series data spanning decades for many indicators.
- Size: Thousands of indicator series across 194 countries
- Format: CSV, JSON; API available
- Access: Free; no registration required
- Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series), Ch. 36 (healthcare applications)
CMS Medicare Provider Utilization and Payment Data
- Source: data.cms.gov
- Description: Detailed information on services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals, including utilization, charges, and payments. Enables analysis of healthcare spending patterns, provider variation, and potential fraud detection.
- Size: Millions of records across multiple years and data types
- Format: CSV
- Access: Free; no registration required
- Relevant Chapters: Ch. 7 (classification — fraud detection), Ch. 9 (clustering — provider segmentation), Ch. 36 (healthcare applications)
NIH National Library of Medicine (PubMed / PMC)
- Source: pubmed.ncbi.nlm.nih.gov; bulk download via FTP
- Description: Over 36 million citations and abstracts for biomedical literature from MEDLINE, life science journals, and online books. PubMed Central (PMC) provides full-text articles. An enormous corpus for biomedical NLP, citation analysis, and knowledge graph construction.
- Size: 36+ million abstracts; millions of full-text articles
- Format: XML, plain text; E-utilities API available
- Access: Free; API key recommended for bulk access
- Relevant Chapters: Ch. 14 (NLP — biomedical text mining), Ch. 17 (LLM applications)
Heart Disease Dataset (UCI / Cleveland)
- Source: UCI Machine Learning Repository → "Heart Disease"; also on Kaggle
- Description: Clinical attributes (age, sex, chest pain type, resting blood pressure, cholesterol, fasting blood sugar, ECG results, maximum heart rate, exercise-induced angina) for 303 patients, with a target variable indicating the presence of heart disease. Compact, clean, and widely used for introductory classification exercises.
- Size: 303 instances; 14 attributes
- Format: CSV
- Access: Free; no registration
- Relevant Chapters: Ch. 7 (classification), Ch. 13 (neural networks), Ch. 26 (explainability)
COVID-19 Open Datasets
- Source: Multiple sources — Johns Hopkins CSSE (github.com/CSSEGISandData/COVID-19), Our World in Data (ourworldindata.org), WHO COVID-19 Dashboard data
- Description: Daily case counts, deaths, vaccinations, testing rates, and policy responses for countries and sub-national regions worldwide. One of the most intensively analyzed datasets in history, with extensive community analysis and modeling available for reference.
- Size: Hundreds of thousands of daily observations across 200+ countries
- Format: CSV, JSON
- Access: Free; various open licenses
- Relevant Chapters: Ch. 16 (time series), Ch. 30 (responsible AI — pandemic response), Ch. 36 (healthcare)
Chest X-Ray Images (Pneumonia Detection)
- Source: Kaggle → "Chest X-Ray Images (Pneumonia)"
- Description: 5,863 chest X-ray images labeled as normal or pneumonia (bacterial or viral). Organized into train, validation, and test directories. A popular introductory dataset for medical image classification with convolutional neural networks.
- Size: 5,863 images; approximately 1.2 GB
- Format: JPEG images in directory structure
- Access: Free with Kaggle account
- Relevant Chapters: Ch. 15 (computer vision), Ch. 36 (healthcare AI)
Drug Review Dataset
- Source: UCI Machine Learning Repository → "Drug Review Dataset (Drugs.com)"
- Description: Over 200,000 patient drug reviews scraped from Drugs.com, including the drug name, condition being treated, the review text, a 10-point rating, and the date. Useful for sentiment analysis, topic modeling, and understanding patient experience.
- Size: ~215,000 reviews
- Format: TSV
- Access: Free; no registration
- Relevant Chapters: Ch. 14 (NLP — sentiment analysis), Ch. 24 (customer experience)
2.4 Manufacturing and Industrial
NASA Turbofan Engine Degradation Simulation (C-MAPSS)
- Source: NASA Prognostics Center of Excellence → data repository; also available on Kaggle
- Description: Simulated run-to-failure data for turbofan jet engines under different operating conditions and fault modes. Contains multivariate time series from 21 sensors, with the goal of predicting Remaining Useful Life (RUL). The canonical dataset for predictive maintenance research.
- Size: Four sub-datasets (FD001–FD004) with 100 to 249 engines each
- Format: Text files (space-delimited)
- Access: Free; public domain
- Relevant Chapters: Ch. 8 (regression — RUL prediction), Ch. 16 (time series), Ch. 36 (manufacturing applications)
Steel Plates Faults Dataset
- Source: UCI Machine Learning Repository → "Steel Plates Faults"
- Description: 1,941 instances of steel plate faults classified into seven categories (pastry, Z-scratch, K-scratch, stains, dirtiness, bumps, other faults). Each instance has 27 numeric features describing the fault geometry and steel properties. A clean multi-class classification problem relevant to quality control in manufacturing.
- Size: 1,941 instances; 27 features; 7 classes
- Format: CSV
- Access: Free; no registration
- Relevant Chapters: Ch. 7 (classification — multi-class), Ch. 36 (manufacturing)
SECOM Semiconductor Manufacturing Dataset
- Source: UCI Machine Learning Repository → "SECOM"
- Description: 1,567 observations from a semiconductor manufacturing process, with 590 sensor measurements and a binary pass/fail target. Characterized by high dimensionality, missing values, and severe class imbalance — all of which mirror real-world manufacturing data challenges.
- Size: 1,567 instances; 590 features
- Format: CSV / space-delimited text
- Access: Free; no registration
- Relevant Chapters: Ch. 7 (classification — imbalanced classes), Ch. 5 (EDA — handling missing data), Ch. 36 (manufacturing)
Predictive Maintenance Dataset (Microsoft)
- Source: Azure AI Gallery / Kaggle → "Predictive Maintenance" or "AI4I 2020 Predictive Maintenance"
- Description: Synthetic but realistic dataset simulating machine failures in an industrial setting. The AI4I version contains 10,000 data points with features like air temperature, process temperature, rotational speed, torque, and tool wear, plus failure mode labels (heat dissipation, power, overstrain, tool wear, random). Designed for teaching predictive maintenance without real industrial data access constraints.
- Size: 10,000 instances; 14 features
- Format: CSV
- Access: Free; no registration
- Relevant Chapters: Ch. 7 (classification), Ch. 36 (manufacturing applications)
Tennessee Eastman Process Simulation
- Source: Available via academic repositories and GitHub; search "Tennessee Eastman Process dataset"
- Description: Simulated data from a chemical process model developed by Eastman Chemical Company. Contains 52 process variables under normal operation and 21 different fault conditions. Widely used in process monitoring, fault detection, and industrial AI research.
- Size: Thousands of time-indexed observations per simulation run
- Format: DAT, CSV (depending on version)
- Access: Free; academic use
- Relevant Chapters: Ch. 9 (anomaly detection), Ch. 16 (time series), Ch. 36 (manufacturing)
Bosch Production Line Performance
- Source: Kaggle → "Bosch Production Line Performance"
- Description: Anonymous measurements from Bosch's manufacturing process, with the task of predicting internal failures. Features over 4,000 numeric, categorical, and timestamp columns across three files. Extremely high-dimensional and sparse, representing a realistic large-scale manufacturing data challenge.
- Size: ~1.2 million parts; 4,000+ features; approximately 13 GB
- Format: CSV
- Access: Free with Kaggle account; competition terms
- Relevant Chapters: Ch. 7 (classification), Ch. 5 (EDA — high-dimensional data), Ch. 36 (manufacturing)
2.5 NLP and Text Data
Common Crawl
- Source: commoncrawl.org
- Description: A massive, open corpus of web crawl data collected since 2008. Each monthly crawl contains billions of web pages, totaling petabytes of data. The raw material from which many large language models are trained. For classroom use, the pre-processed WET (text-only) files are most practical; use the index to download subsets rather than the full corpus.
- Size: Petabytes in total; individual monthly crawls are tens of terabytes
- Format: WARC, WAT, WET (compressed text)
- Access: Free; stored on AWS Open Data
- Relevant Chapters: Ch. 14 (NLP), Ch. 17 (LLMs — training data), Ch. 25 (bias — web-sourced biases)
Wikipedia Dumps
- Source: dumps.wikimedia.org
- Description: Complete database dumps of all Wikimedia projects, including the full text of Wikipedia in all languages. The English Wikipedia dump contains millions of articles and is one of the most commonly used corpora for NLP research, knowledge base construction, and entity linking.
- Size: English Wikipedia: ~20 GB compressed (text only); full with metadata: 80+ GB
- Format: XML, SQL dumps; community tools exist for conversion to plain text or JSON
- Access: Free; Creative Commons Attribution-ShareAlike license
- Relevant Chapters: Ch. 14 (NLP — text processing), Ch. 17 (LLMs), Ch. 21 (RAG pipeline — knowledge base)
Yelp Open Dataset
- Source: yelp.com/dataset
- Description: A subset of Yelp's business, review, and user data, containing over 6.9 million reviews, 150,000 businesses, and 200,000 photos across 11 metropolitan areas. Includes review text, star ratings, business attributes, check-in data, and user social network information. Rich enough for sentiment analysis, recommendation systems, graph analysis, and photo classification.
- Size: ~6.9 million reviews; approximately 10 GB total
- Format: JSON
- Access: Free after agreeing to Yelp Dataset License; restricted to academic and educational use
- Relevant Chapters: Ch. 14 (NLP —
ReviewAnalyzer), Ch. 10 (recommendations), Ch. 9 (clustering)
IMDB Movie Reviews
- Source: Available via Hugging Face Datasets, Kaggle, or Stanford AI Lab
- Description: 50,000 movie reviews split evenly between positive and negative sentiment. One of the most popular benchmark datasets for binary sentiment classification. Clean, well-balanced, and extensively benchmarked across hundreds of published models.
- Size: 50,000 reviews; approximately 80 MB
- Format: Text files or CSV; also available via
datasetslibrary - Access: Free; no registration required
- Relevant Chapters: Ch. 14 (NLP — sentiment classification), Ch. 13 (neural networks — text classification)
SQuAD (Stanford Question Answering Dataset)
- Source: rajpurkar.github.io/SQuAD-explorer; also on Hugging Face
- Description: A reading comprehension dataset consisting of questions posed on Wikipedia articles, where the answer to each question is a segment of the corresponding passage. SQuAD 2.0 adds unanswerable questions, requiring models to know when they do not know. A key benchmark for evaluating LLM comprehension capabilities.
- Size: 100,000+ question-answer pairs (SQuAD 1.1); 150,000+ including unanswerable (SQuAD 2.0)
- Format: JSON
- Access: Free; CC BY-SA 4.0
- Relevant Chapters: Ch. 14 (NLP), Ch. 17 (LLMs — evaluation), Ch. 21 (RAG pipeline)
GLUE and SuperGLUE Benchmarks
- Source: gluebenchmark.com / super.gluebenchmark.com; available via Hugging Face
- Description: Collections of diverse NLP tasks designed to evaluate and compare language understanding models. GLUE includes sentiment analysis (SST-2), textual entailment (MNLI, RTE), paraphrase detection (QQP, MRPC), and more. SuperGLUE adds harder tasks like reading comprehension with commonsense reasoning (ReCoRD), word sense disambiguation (WiC), and causal reasoning (COPA).
- Size: Varies by task; typically thousands to hundreds of thousands of examples
- Format: TSV, JSON; loadable via Hugging Face
datasets - Access: Free
- Relevant Chapters: Ch. 14 (NLP — benchmarking), Ch. 17 (LLMs — evaluation), Ch. 11 (model evaluation)
AG News Classification Dataset
- Source: Kaggle; also on Hugging Face
- Description: 127,600 news articles categorized into four classes: World, Sports, Business, and Science/Technology. Constructed from the AG corpus of news articles collected by ComeToMyHead academic search engine. A clean, balanced multi-class text classification dataset ideal for introductory NLP exercises.
- Size: 127,600 articles (120,000 train / 7,600 test)
- Format: CSV
- Access: Free
- Relevant Chapters: Ch. 14 (NLP — text classification), Ch. 7 (classification)
20 Newsgroups
- Source: scikit-learn built-in (
sklearn.datasets.fetch_20newsgroups); also on various mirrors - Description: A collection of approximately 20,000 newsgroup documents partitioned across 20 different newsgroup categories. One of the original benchmark datasets for text classification and topic modeling. Loadable directly from scikit-learn with a single function call.
- Size: ~20,000 documents; 20 categories
- Format: Text; directly loadable in scikit-learn
- Access: Free; public domain
- Relevant Chapters: Ch. 9 (clustering — topic modeling), Ch. 14 (NLP — text classification)
2.6 Computer Vision
ImageNet (ILSVRC)
- Source: image-net.org
- Description: Over 14 million hand-annotated images organized according to the WordNet hierarchy. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) subset, with 1,000 categories and 1.2 million training images, is the benchmark that catalyzed the deep learning revolution in 2012 when AlexNet won the competition. Pre-trained ImageNet weights are available in every major deep learning framework.
- Size: Full: 14+ million images; ILSVRC: ~1.2 million images; approximately 150 GB
- Format: JPEG images with XML annotations
- Access: Free for research and educational use; requires registration
- Relevant Chapters: Ch. 13 (neural networks — transfer learning), Ch. 15 (computer vision)
COCO (Common Objects in Context)
- Source: cocodataset.org
- Description: Over 330,000 images with 80 object categories, 91 stuff categories, and 250,000 people with keypoints. Annotations include bounding boxes, segmentation masks, and captions. The standard benchmark for object detection, instance segmentation, and image captioning.
- Size: ~330,000 images; approximately 25 GB
- Format: JPEG images; JSON annotations
- Access: Free; Creative Commons Attribution 4.0
- Relevant Chapters: Ch. 15 (computer vision — object detection), Ch. 18 (multimodal — image captioning)
Open Images Dataset (Google)
- Source: storage.googleapis.com/openimages
- Description: A dataset of approximately 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. Covers 600 object classes with over 16 million bounding boxes. More diverse and larger than COCO, though annotations can be noisier.
- Size: ~9 million images; hundreds of GB
- Format: JPEG images; CSV annotations
- Access: Free; Creative Commons Attribution 4.0
- Relevant Chapters: Ch. 15 (computer vision), Ch. 18 (multimodal)
CIFAR-10 and CIFAR-100
- Source: cs.toronto.edu → CIFAR datasets; built into most deep learning frameworks
- Description: CIFAR-10 contains 60,000 32×32 color images in 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). CIFAR-100 has the same image count but spans 100 fine-grained classes grouped into 20 superclasses. Small enough to train on a laptop CPU/GPU, making them ideal for rapid prototyping and experimentation.
- Size: 60,000 images each; approximately 170 MB (CIFAR-10), 170 MB (CIFAR-100)
- Format: Binary pickle files; also available as PNG images; built into PyTorch and TensorFlow
- Access: Free; no registration
- Relevant Chapters: Ch. 13 (neural networks — CNNs), Ch. 15 (computer vision)
MNIST and Fashion-MNIST
- Source: yann.lecun.com/exdb/mnist (MNIST); github.com/zalandoresearch/fashion-mnist (Fashion-MNIST)
- Description: MNIST is the "hello world" of machine learning: 70,000 grayscale 28×28 images of handwritten digits (0–9). Fashion-MNIST is a drop-in replacement with 10 categories of clothing items (t-shirt, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, ankle boot), designed to be slightly more challenging. Both are built into virtually every ML framework.
- Size: 70,000 images each; approximately 11 MB (MNIST), 30 MB (Fashion-MNIST)
- Format: IDX format; directly loadable in scikit-learn, PyTorch, TensorFlow
- Access: Free; no registration
- Relevant Chapters: Ch. 7 (classification intro), Ch. 13 (neural networks)
CelebA (CelebFaces Attributes Dataset)
- Source: mmlab.ie.cuhk.edu.hk → CelebA; also on Kaggle
- Description: Over 200,000 celebrity face images annotated with 40 binary attributes (smiling, wearing glasses, male, etc.) and 5 landmark locations. Widely used for facial attribute prediction, face generation, and bias analysis in facial recognition systems.
- Size: 202,599 images; approximately 1.4 GB
- Format: JPEG images; CSV annotations
- Access: Free for non-commercial research
- Relevant Chapters: Ch. 15 (computer vision), Ch. 25 (bias — facial recognition disparities)
Stanford Cars / Oxford Pets / Flowers
- Source: Various Stanford and Oxford research group pages; also on Kaggle and TensorFlow Datasets
- Description: Fine-grained visual classification datasets covering 196 car models, 37 pet breeds, and 102 flower species respectively. Designed for transfer learning experiments where distinguishing between visually similar subcategories is the challenge.
- Size: 8,000–16,000 images each
- Format: JPEG images with labels
- Access: Free for research
- Relevant Chapters: Ch. 15 (computer vision — transfer learning), Ch. 13 (neural networks)
3. APIs for Real-Time Data
Real-time data enables you to build applications that respond to current conditions rather than analyzing historical snapshots. This section covers APIs that provide programmatic access to live or frequently updated data.
Note on API keys: Most APIs below require a free registration and API key. Store your keys in environment variables or a
.envfile — never hardcode them in scripts or commit them to version control. See Chapter 23 for API integration patterns.
3.1 Financial Data APIs
Alpha Vantage
- Source: alphavantage.co; Python wrapper:
alpha_vantagepackage - Description: Free API providing real-time and historical stock prices, forex rates, cryptocurrency prices, and over 50 technical indicators. The free tier allows 25 requests per day. A cleaner alternative to scraping financial websites.
- Access: Free tier with API key; premium plans available
- Relevant Chapters: Ch. 16 (time series), Ch. 23 (cloud AI services and APIs)
Polygon.io
- Source: polygon.io; Python client available
- Description: Financial market data API covering stocks, options, forex, and crypto. Provides real-time and historical data including trades, quotes, bars, and reference data. More robust rate limits than free alternatives.
- Access: Free basic tier; paid tiers for real-time data
- Relevant Chapters: Ch. 16 (time series), Ch. 23 (APIs)
3.2 Social Media and Web APIs
Reddit API (via PRAW)
- Source: reddit.com/dev/api; Python wrapper:
prawpackage - Description: Access Reddit posts, comments, user data, and subreddit metadata. Useful for sentiment analysis, trend detection, and social network analysis. The
praw(Python Reddit API Wrapper) library simplifies authentication and pagination. - Access: Free with Reddit app registration; rate-limited
- Relevant Chapters: Ch. 14 (NLP — sentiment analysis), Ch. 24 (marketing — social listening)
X (Twitter) API
- Source: developer.x.com
- Description: Access tweets, user profiles, followers, and trending topics. The API has undergone significant changes in access levels and pricing since 2023. The free tier provides limited write access and basic search; the basic paid tier provides additional read access. Important for understanding real-time public discourse, but plan for access constraints.
- Access: Free tier with limited access; paid tiers for broader access
- Relevant Chapters: Ch. 14 (NLP), Ch. 24 (marketing AI), Ch. 16 (time series — trend detection)
NewsAPI
- Source: newsapi.org; Python client available
- Description: Aggregates headlines and articles from over 150,000 news sources and blogs worldwide. Searchable by keyword, source, language, and date range. The free tier is limited to development use (100 requests/day, articles up to 1 month old); production use requires a paid plan.
- Access: Free developer tier; paid for production
- Relevant Chapters: Ch. 14 (NLP — news classification), Ch. 17 (LLMs — summarization)
3.3 Weather and Environmental APIs
OpenWeatherMap API
- Source: openweathermap.org/api; Python wrapper available
- Description: Current weather, 5-day forecasts, historical data, and weather alerts for any location worldwide. Supports queries by city name, coordinates, or ZIP code. The free tier provides current weather and 5-day forecasts.
- Access: Free tier (60 calls/minute); paid tiers for historical data and advanced features
- Relevant Chapters: Ch. 8 (regression — weather as feature), Ch. 16 (time series), Ch. 23 (APIs)
NOAA Climate Data Online
- Source: ncdc.noaa.gov; Climate Data Online API
- Description: Historical weather and climate data from NOAA's extensive observation network, including daily summaries, monthly normals, and extreme weather events. Covers stations worldwide with records going back over a century.
- Access: Free with API token
- Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series)
3.4 Geospatial APIs
OpenStreetMap (Overpass API)
- Source: openstreetmap.org; Overpass API at overpass-api.de
- Description: Volunteer-contributed global mapping data including roads, buildings, businesses, parks, transit stops, and more. The Overpass API allows complex spatial queries. Python libraries
osmnxandgeopyprovide convenient access. - Access: Free; Open Database License
- Relevant Chapters: Ch. 36 (industry applications — logistics, real estate)
Google Maps Platform APIs
- Source: developers.google.com/maps
- Description: Geocoding, directions, places, distance matrix, and static/dynamic maps. Essential for location-based analytics — calculating drive times, finding nearby competitors, and geocoding addresses.
- Access: Free tier ($200/month credit); pay-as-you-go beyond that
- Relevant Chapters: Ch. 23 (cloud AI services), Ch. 36 (retail and logistics)
3.5 Government and Open Data APIs
Census Bureau API
- Source: api.census.gov
- Description: Programmatic access to U.S. Census data, including the American Community Survey, Decennial Census, Economic Census, and Population Estimates. Supports detailed geographic queries down to the census tract level. Python wrapper
cenpysimplifies access. - Access: Free with API key
- Relevant Chapters: Ch. 8 (regression — demographic features), Ch. 9 (clustering — geographic segmentation), Ch. 36 (public sector)
Bureau of Labor Statistics (BLS) API
- Source: bls.gov/developers
- Description: Employment, unemployment, consumer prices, producer prices, wages, and productivity data for the United States. Time series data with monthly, quarterly, and annual frequency.
- Access: Free; API key recommended for higher rate limits
- Relevant Chapters: Ch. 8 (regression), Ch. 16 (time series)
3.6 AI and Machine Learning APIs
OpenAI API
- Source: platform.openai.com
- Description: Access to GPT-4, GPT-3.5, DALL-E, Whisper, and embedding models. The primary API used in Chapters 17, 19, and 20 for LLM integration, prompt engineering, and AI-powered workflows. Supports chat completions, function calling, vision, and fine-tuning.
- Access: Pay-per-use; free trial credits available for new accounts
- Relevant Chapters: Ch. 17 (LLMs), Ch. 19–20 (prompt engineering), Ch. 21 (RAG pipeline), Ch. 23 (cloud AI services)
Hugging Face Inference API
- Source: huggingface.co → Inference API
- Description: Run inference on thousands of open-source models hosted on Hugging Face, including text generation, classification, summarization, translation, image classification, and more. Free tier available for experimentation; rate-limited.
- Access: Free tier; paid Inference Endpoints for production
- Relevant Chapters: Ch. 14 (NLP), Ch. 17 (LLMs — open-source alternatives), Ch. 23 (APIs)
4. Synthetic Data Tools
When real-world data is unavailable, too sensitive, too expensive, or insufficiently diverse, synthetic data can fill the gap. This section covers tools for generating realistic fake data and when to use them.
4.1 When to Use Synthetic Data
Synthetic data is appropriate in the following scenarios:
- Privacy constraints — You need data that mimics real patient, customer, or financial records without exposing actual individuals (Ch. 29)
- Class imbalance — You need more examples of a rare class (fraud, equipment failure) to train a balanced model (Ch. 7, Ch. 11)
- Prototyping — You want to build and test a pipeline before real data is available (Ch. 12, Ch. 33)
- Augmentation — You want to expand a small training set while preserving statistical properties (Ch. 13, Ch. 15)
- Testing and QA — You need diverse test cases to validate data pipelines, dashboards, or applications (Ch. 12)
- Bias mitigation — You want to generate counterfactual examples to test fairness (Ch. 25, Ch. 26)
Synthetic data is not appropriate when:
- Regulatory requirements demand auditable, real-world evidence
- The underlying distribution is unknown or highly complex and you cannot validate the synthetic data's fidelity
- Downstream decisions have high stakes and the synthetic data has not been rigorously evaluated against held-out real data
4.2 Python Libraries for Synthetic Data
Faker
- Source:
pip install faker; documentation at faker.readthedocs.io - Description: Generates fake but realistic personal data — names, addresses, phone numbers, emails, dates, job titles, credit card numbers, text paragraphs, and more. Supports localization for 60+ languages. The simplest tool for populating databases, testing pipelines, and creating demonstration datasets. NK uses this in her early data strategy exercises at Athena Retail Group to prototype customer databases without accessing production data.
- Use Case: Creating realistic test data for pipeline development, database schema testing, and UI prototyping
- Example:
from faker import Faker
import pandas as pd
fake = Faker()
Faker.seed(42) # Reproducibility
customers = pd.DataFrame({
'customer_id': range(1, 1001),
'name': [fake.name() for _ in range(1000)],
'email': [fake.email() for _ in range(1000)],
'city': [fake.city() for _ in range(1000)],
'signup_date': [fake.date_between(start_date='-3y') for _ in range(1000)],
'lifetime_value': [round(fake.random.uniform(10, 5000), 2) for _ in range(1000)]
})
- Relevant Chapters: Ch. 3 (Python basics), Ch. 4 (data strategy — prototype datasets), Ch. 12 (MLOps — test data)
SDV (Synthetic Data Vault)
- Source:
pip install sdv; documentation at docs.sdv.dev - Description: A suite of machine learning models that learn the statistical properties of your real data and generate new synthetic rows that preserve those properties. Supports single tables (using Gaussian copulas or CTGAN), multi-table relational databases, and time series. The most sophisticated open-source synthetic data library available.
- Use Case: Generating privacy-safe replicas of sensitive datasets (healthcare, finance) while preserving column correlations, distributions, and constraints
- Key Models:
GaussianCopulaSynthesizer— Fast, parametric; best for well-behaved numerical dataCTGANSynthesizer— GAN-based; handles mixed data types and complex distributionsTVAESynthesizer— VAE-based; often faster convergence than CTGAN- Example:
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=5000)
- Relevant Chapters: Ch. 7 (classification — augmenting rare classes), Ch. 25 (bias — counterfactual generation), Ch. 29 (privacy — de-identification alternatives)
scikit-learn Synthetic Data Generators
- Source: Built into scikit-learn (
sklearn.datasets) - Description: Functions for generating synthetic classification, regression, and clustering datasets with controlled properties. Useful when you need data with specific characteristics (number of informative features, noise level, class balance) for teaching or experimentation.
- Key Functions:
make_classification()— Binary or multi-class classification with controllable separabilitymake_regression()— Linear regression targets with specified noisemake_blobs()— Gaussian clusters for clustering exercisesmake_moons()/make_circles()— Non-linearly separable classes for demonstrating kernel methods and neural networks- Example:
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=10000,
n_features=20,
n_informative=10,
n_redundant=5,
weights=[0.95, 0.05], # Imbalanced classes
random_state=42
)
- Relevant Chapters: Ch. 7 (classification — controlled experiments), Ch. 9 (clustering —
make_blobs), Ch. 11 (model evaluation — learning curves), Ch. 13 (neural networks —make_moons)
NumPy / SciPy Statistical Distributions
- Source: Built into NumPy (
numpy.random) and SciPy (scipy.stats) - Description: Generate random samples from any standard probability distribution — normal, uniform, exponential, Poisson, beta, gamma, binomial, and dozens more. Essential for Monte Carlo simulations, bootstrapping, and creating controlled experimental datasets.
- Use Case: Simulating business scenarios (demand variability, arrival rates, defect probabilities), creating noise for model robustness testing
- Relevant Chapters: Ch. 8 (regression — simulated data), Ch. 16 (time series — synthetic series), Ch. 34 (ROI — Monte Carlo simulation)
4.3 Commercial and Cloud Synthetic Data Platforms
Gretel.ai
- Source: gretel.ai; Python SDK available via
pip install gretel-client - Description: A cloud platform for generating synthetic data using deep learning models. Offers pre-built APIs for tabular data synthesis, text generation, and data transformation. Includes built-in privacy evaluation (SQS — Synthetic Quality Score) that measures fidelity and privacy simultaneously. The free tier supports small datasets and limited API calls.
- Features: Tabular synthesis (LSTM, ACTGAN), text synthesis, data de-identification, privacy metrics
- Access: Free developer tier; paid plans for larger datasets and production use
- Relevant Chapters: Ch. 29 (privacy), Ch. 12 (MLOps — data pipeline testing)
MOSTLY AI
- Source: mostly.ai; cloud platform with free tier
- Description: Enterprise-focused synthetic data platform specializing in privacy-safe data generation for tabular and sequential data. Generates synthetic data that maintains correlations, distributions, and temporal patterns from the original while providing formal privacy guarantees. Includes quality reports comparing synthetic and original data distributions.
- Features: Smart imputation, rare category handling, time series, linked tables
- Access: Free tier (100K rows/day); paid enterprise plans
- Relevant Chapters: Ch. 29 (privacy), Ch. 4 (data strategy)
Tonic.ai
- Source: tonic.ai
- Description: Focused on creating realistic, de-identified replicas of production databases for development and testing. Preserves referential integrity across tables, maintains data format constraints, and supports subset generation. Designed for engineering and data teams that need production-like data in non-production environments.
- Features: Subsetting, referential integrity, format-preserving transformations, consistency
- Access: Free community edition; paid plans for enterprise
- Relevant Chapters: Ch. 12 (MLOps — staging environments), Ch. 29 (privacy)
Hazy
- Source: hazy.com
- Description: Enterprise synthetic data platform with a focus on regulated industries (financial services, healthcare, insurance). Uses differentially private generative models to produce synthetic data with formal privacy guarantees. Includes data quality reports and compliance documentation.
- Features: Differential privacy, compliance reporting, sequential data, multi-table support
- Access: Enterprise pricing; demo available
- Relevant Chapters: Ch. 29 (privacy — differential privacy), Ch. 28 (AI regulation — compliance)
Synthesis AI
- Source: synthesis.ai
- Description: Specializes in synthetic image and video data for computer vision. Generates photorealistic faces, people, and environments with pixel-perfect labels (segmentation masks, depth maps, landmarks). Useful when collecting and labeling real images is expensive or raises privacy concerns (e.g., facial recognition training).
- Features: Photorealistic rendering, automatic labeling, diversity controls, scene customization
- Access: Paid; custom pricing
- Relevant Chapters: Ch. 15 (computer vision — training data), Ch. 25 (bias — diverse training sets)
5. Data for Specific Book Exercises
This section maps key exercises and projects from each part of the textbook to recommended datasets. Use this as a quick reference when you need data for a specific assignment.
Part 1: Foundations of AI for Business (Chapters 1–6)
| Exercise / Project | Recommended Dataset(s) | Notes |
|---|---|---|
| Ch. 3: pandas data manipulation practice | FiveThirtyEight datasets, Faker-generated data | Small, clean CSVs for learning syntax |
| Ch. 4: Data quality assessment exercise | Online Retail Dataset (UCI) | Contains missing values, duplicates, and data type issues — ideal for data quality auditing |
Ch. 5: EDAReport builder |
Olist Brazilian E-Commerce Dataset | Multiple related tables for joining; rich mix of numeric and categorical features |
| Ch. 5: Visualization exercises | Kaggle — "World Happiness Report" | Clean, well-structured, and produces compelling visualizations |
| Ch. 6: Business case assessment | Walmart Sales Forecasting | Clear business objective; quantifiable outcomes for ROI estimation |
Part 2: Core Machine Learning for Business (Chapters 7–12)
| Exercise / Project | Recommended Dataset(s) | Notes |
|---|---|---|
Ch. 7: ChurnClassifier |
Online Retail Dataset (UCI), Kaggle "Telco Customer Churn" | Telco Churn is a clean binary classification dataset with clear business interpretation |
| Ch. 7: Fraud detection exercise | Credit Card Fraud Detection (Kaggle) | Class imbalance; precision-recall tradeoffs |
Ch. 8: DemandForecaster |
Walmart Sales Forecasting, Rossmann Store Sales | Multi-store, multi-feature regression with calendar effects |
| Ch. 8: Price prediction exercise | Kaggle "House Prices" | Classic regression competition; rich feature engineering opportunities |
Ch. 9: CustomerSegmenter |
Online Retail Dataset (UCI), Olist | RFM features for clustering; validate against known segments |
Ch. 10: RecommendationEngine |
Instacart Market Basket Analysis, H&M Fashion | Market basket and collaborative filtering exercises |
Ch. 11: ModelEvaluator |
German Credit Data (UCI) | Multiple model comparison with fairness constraints |
| Ch. 12: MLOps pipeline exercise | Faker-generated data, scikit-learn make_classification |
Synthetic data for controlled pipeline testing |
Part 3: Deep Learning and Specialized AI (Chapters 13–18)
| Exercise / Project | Recommended Dataset(s) | Notes |
|---|---|---|
| Ch. 13: Neural network from scratch | MNIST, Fashion-MNIST, CIFAR-10 | Start with MNIST; graduate to CIFAR-10 for CNNs |
Ch. 14: ReviewAnalyzer |
Amazon Product Reviews, IMDB Reviews, Yelp Open Dataset | Sentiment analysis at different scales |
| Ch. 14: Text classification exercise | AG News, 20 Newsgroups | Multi-class text classification |
| Ch. 15: Image classification project | Chest X-Ray (Pneumonia), Stanford Cars, Oxford Pets | Transfer learning with pre-trained ImageNet models |
| Ch. 15: Object detection exercise | COCO | Standard bounding box and segmentation tasks |
| Ch. 16: Prophet forecasting | Yahoo Finance (yfinance), FRED |
Financial time series with trend, seasonality, and holidays |
| Ch. 17: LLM integration | OpenAI API + SEC EDGAR filings | Document summarization and information extraction |
| Ch. 18: Multimodal exercise | H&M Fashion (images + metadata), COCO Captions | Combining image and text features |
Part 4: Prompt Engineering and AI Tools (Chapters 19–24)
| Exercise / Project | Recommended Dataset(s) | Notes |
|---|---|---|
Ch. 19: PromptBuilder exercises |
Any text dataset; Faker for generating test inputs | Focus is on prompt design, not data complexity |
Ch. 20: PromptChain exercises |
SEC EDGAR filings, NewsAPI | Multi-step reasoning over real documents |
| Ch. 21: RAG pipeline | Wikipedia dumps (subset), PubMed abstracts | Knowledge base for retrieval-augmented generation |
| Ch. 22: No-code/low-code AI projects | Kaggle "Titanic" (AutoML comparison) | Compare no-code tool results against manual approaches |
| Ch. 23: API integration lab | OpenAI API, Alpha Vantage, OpenWeatherMap | Practice API authentication, error handling, rate limiting |
| Ch. 24: Marketing AI project | Yelp Reviews, Reddit API, Olist reviews | Customer sentiment analysis and social listening |
Part 5: AI Ethics, Bias, and Governance (Chapters 25–30)
| Exercise / Project | Recommended Dataset(s) | Notes |
|---|---|---|
Ch. 25: BiasDetector |
German Credit Data (UCI), COMPAS recidivism data (ProPublica), CelebA | Datasets with known demographic disparities |
Ch. 26: ExplainabilityDashboard |
Lending Club loan data, Heart Disease Dataset | Models with high-stakes decisions requiring explanation |
| Ch. 27: Governance framework exercise | No specific dataset needed | Framework and checklist exercise |
| Ch. 28: Regulation analysis | EU AI Act text, NIST AI RMF documentation | Policy documents, not data |
| Ch. 29: Privacy exercise | MIMIC-III, Faker-generated PII | De-identification and privacy-preserving analytics |
| Ch. 30: Responsible AI audit | Any Ch. 7–9 model output + SDV synthetic data | Full audit pipeline: bias check, explainability, privacy |
Part 6: AI Strategy and Organizational Transformation (Chapters 31–36)
| Exercise / Project | Recommended Dataset(s) | Notes |
|---|---|---|
Ch. 34: AIROICalculator |
Walmart Sales (before/after modeling), FRED economic data | Quantifying business impact of AI interventions |
| Ch. 36: Industry deep dive | Industry-specific datasets from Section 2 above | Select based on your chosen industry vertical |
Part 7 and Capstone (Chapters 37–40)
| Exercise / Project | Recommended Dataset(s) | Notes |
|---|---|---|
Ch. 39: AIMaturityAssessment + TransformationRoadmapGenerator |
Faker for simulated organizational data; real company data if available | The capstone AI Transformation Plan benefits from combining real industry data with synthetic organizational metrics |
| Ch. 39: Capstone dataset selection | Use Google Dataset Search + this appendix | Choose datasets aligned with your transformation domain (retail, finance, healthcare, manufacturing) |
Quick Reference: Dataset Selection Checklist
Before committing to a dataset for your project, run through this checklist:
-
Relevance — Does the dataset align with your business question? A technically interesting dataset that does not map to a real business problem will produce a weak project.
-
Size — Is the dataset large enough for your chosen method? Deep learning generally needs thousands of examples; classical ML can work with hundreds. Is it small enough to work with given your compute resources?
-
Quality — How much preprocessing is required? Some datasets are analysis-ready; others require significant cleaning. Factor this time into your project plan.
-
Documentation — Is there a data dictionary, codebook, or schema description? Poorly documented data leads to incorrect assumptions.
-
Recency — When was the data collected? A dataset from 2010 may not reflect current patterns, especially in fast-moving domains like social media or e-commerce.
-
License — Can you legally use the data for your intended purpose? Academic use, commercial use, and redistribution rights vary by dataset. Check before you build.
-
Ethical considerations — Does the data contain sensitive attributes (race, gender, health status)? If so, have appropriate protections been applied? Could your analysis cause harm to the populations represented? See Chapters 25–30 for frameworks.
-
Reproducibility — Can someone else access the same data to verify your results? APIs with changing data are harder to reproduce from than static CSV downloads. Consider saving a snapshot.
A Final Note on Data Ethics
As Prof. Okonkwo reminds us throughout this textbook, data is never neutral. Every dataset reflects the choices of the people who collected it — what they chose to measure, whom they chose to include, and how they chose to categorize the world. The datasets listed in this appendix are powerful tools for learning, but they are not free from bias, omission, or error.
When working with any dataset, ask yourself:
- Who collected this data, and why?
- Who is represented in this data, and who is missing?
- What assumptions are embedded in the category definitions and labels?
- What could go wrong if a model trained on this data were deployed in the real world?
These questions are not obstacles to your analysis — they are essential components of it. The most technically sophisticated model built on unexamined data is, as Tom learns at Athena Retail Group, a liability disguised as an asset.
"Data does not speak for itself. It speaks through the lens of whoever collected it, cleaned it, and chose to analyze it. Your job is to understand that lens." — Prof. Diane Okonkwo, Chapter 4 closing lecture