Appendix E: Dataset Catalog

A curated collection of real-world datasets for practicing the techniques covered in this book. Every dataset listed here is freely available, well-documented, and maps to specific chapters. This is not an exhaustive catalog — it is an opinionated selection of datasets that are actually worth your time.

How to Use This Catalog

Each entry includes:

  • Source: Where to download it
  • Size: Approximate row/feature counts and file size
  • Description: What the data represents
  • Good for: Which chapters and techniques it supports
  • Watch out: Known quirks, cleaning challenges, or ethical considerations

Start with datasets in the domain closest to your professional experience. Domain knowledge matters more than algorithmic sophistication — that is one of this book's central arguments.


Healthcare

MIMIC-III Clinical Database

  • Source: PhysioNet (https://physionet.org/content/mimiciii/1.4/) — requires credentialed access and ethics training
  • Size: ~58,000 hospital admissions, 40+ tables, ~6 GB compressed
  • Description: De-identified health data from ICU patients at Beth Israel Deaconess Medical Center, 2001-2012. Includes demographics, lab results, medications, ICD-9 codes, nursing notes, vital signs, and mortality outcomes.
  • Good for: Chapters 5 (SQL joins across relational tables), 7 (categorical encoding of diagnosis codes), 8 (extensive missing data), 16-17 (class imbalance in mortality prediction), 19 (model interpretation for clinical decisions), 33 (fairness across demographic groups)
  • Watch out: Requires CITI training and a credentialed PhysioNet account. Do not attempt to re-identify patients. The SQL complexity is real — this is production-grade relational data, not a flat CSV.

Heart Disease Dataset (UCI)

  • Source: UCI ML Repository (https://archive.ics.uci.edu/dataset/45/heart+disease)
  • Size: 303 rows, 14 features, ~12 KB
  • Description: Patient records from Cleveland Clinic with demographic, clinical, and test features predicting the presence of heart disease.
  • Good for: Chapters 11 (logistic regression baseline), 13 (tree-based methods), 14 (gradient boosting), 16 (evaluation metrics on a small dataset), 18 (hyperparameter tuning)
  • Watch out: Small enough that cross-validation variance will be high. Good for learning, too small for production conclusions.

Diabetes 130-US Hospitals Dataset

  • Source: UCI ML Repository (https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008)
  • Size: ~100,000 rows, 50+ features, ~20 MB
  • Description: Ten years of clinical care data from 130 US hospitals. Each row represents a hospital admission for a diabetic patient, including demographics, diagnoses, medications, and whether the patient was readmitted within 30 days.
  • Good for: Chapters 7 (heavy categorical encoding), 8 (missing data patterns), 9 (feature selection from 50+ candidates), 16 (multiclass target — readmission within 30 days, after 30 days, or not at all), 17 (class imbalance), 33 (fairness audit on race and age)
  • Watch out: Closest real-world analog to the Metro General anchor example. The readmission target has three classes; most tutorials collapse it to binary, which changes the problem.

CDC Behavioral Risk Factor Surveillance System (BRFSS)

  • Source: CDC (https://www.cdc.gov/brfss/annual_data/annual_data.htm)
  • Size: ~400,000+ rows per year, 300+ features, ~150 MB per year
  • Description: Annual telephone survey of US residents covering health-related risk behaviors, chronic conditions, and use of preventive services.
  • Good for: Chapters 5 (SQL-style filtering and aggregation), 6 (feature engineering from survey data), 9 (feature selection from hundreds of variables), 21 (dimensionality reduction), 28 (large dataset handling)
  • Watch out: Survey weights matter. Naive analysis without applying weights produces biased estimates. Complex survey design requires careful handling.

E-Commerce and Retail

Instacart Market Basket Analysis

  • Source: Kaggle (https://www.kaggle.com/c/instacart-market-basket-analysis)
  • Size: 3.4 million orders, 200,000 users, 50,000 products, ~1.3 GB
  • Description: Anonymized data on customer orders over time, including product names, departments, aisle information, and order sequences.
  • Good for: Chapters 5 (SQL for order aggregation), 6 (feature engineering — days since last order, reorder rate), 23 (association rules — market basket analysis), 24 (recommender systems), 25 (time series of order patterns)
  • Watch out: The original competition is closed, but the dataset remains available. Feature engineering is where the real work happens here — raw data is transactional.

Online Retail II Dataset (UCI)

  • Source: UCI ML Repository (https://archive.ics.uci.edu/dataset/502/online+retail+ii)
  • Size: ~1 million transactions, 8 features, ~45 MB
  • Description: All transactions from a UK-based online retailer between 2009-2011. Includes invoice number, stock code, description, quantity, price, customer ID, and country.
  • Good for: Chapters 6 (RFM feature engineering), 8 (missing CustomerID values), 20 (customer segmentation via clustering), 22 (anomaly detection — fraudulent returns, bulk orders), 25 (time series of sales)
  • Watch out: Contains negative quantities (returns) and zero-priced items (adjustments). Cleaning these is the first real task.

Olist Brazilian E-Commerce Dataset

  • Source: Kaggle (https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce)
  • Size: ~100,000 orders across 9 relational tables, ~45 MB
  • Description: Real commercial data from Olist, a Brazilian e-commerce marketplace. Includes order status, pricing, payment, shipping performance, customer location, product attributes, and seller reviews.
  • Good for: Chapters 5 (multi-table SQL joins), 6 (feature engineering across relational tables), 10 (building a reproducible pipeline from raw to features), 26 (NLP on Portuguese-language reviews), 27 (geospatial analysis of delivery times by region)
  • Watch out: Review text is in Portuguese. Good opportunity to practice multilingual NLP or to focus on the structured data and ignore text.

Manufacturing and IoT

NASA Turbofan Engine Degradation Simulation (C-MAPSS)

  • Source: NASA Prognostics Data Repository (https://data.nasa.gov/dataset/C-MAPSS-Aircraft-Engine-Simulator-Data/)
  • Size: ~60,000 rows, 26 sensor readings + 3 operational settings, ~12 MB
  • Description: Run-to-failure simulated data for turbofan engines under different operating conditions and fault modes. Each engine starts healthy and develops a fault over time.
  • Good for: Chapters 6 (engineering features from sensor streams — rolling means, rate of change), 14 (gradient boosting for remaining useful life), 22 (anomaly detection on degradation curves), 25 (time series forecasting), 28 (multivariate time series)
  • Watch out: Closest analog to the TurbineTech anchor example. The target is remaining useful life (RUL), a regression target, but many tutorials clip it at 125 cycles — understand why before you do.

Steel Plates Faults Dataset (UCI)

  • Source: UCI ML Repository (https://archive.ics.uci.edu/dataset/198/steel+plates+faults)
  • Size: 1,941 rows, 27 features, 7 fault classes, ~200 KB
  • Description: Features extracted from images of steel plate surfaces to classify types of manufacturing defects.
  • Good for: Chapters 13 (multiclass tree-based classification), 15 (k-NN on manufacturing data), 16 (multiclass evaluation — macro vs. micro averaging), 17 (class imbalance across fault types)
  • Watch out: Small dataset. Some fault classes have very few samples, making stratified splits essential.
  • Source: Kaggle (https://www.kaggle.com/datasets/arnabbiswas1/microsoft-azure-predictive-maintenance)
  • Size: ~10,000 rows across 5 tables (telemetry, errors, maintenance, failures, machines), ~50 MB
  • Description: Synthetic but realistic data simulating machine telemetry, error logs, maintenance records, and component failure events over a year.
  • Good for: Chapters 5 (SQL joins across telemetry and event tables), 6 (time-windowed feature engineering), 8 (missing data in telemetry streams), 17 (severe class imbalance — failures are rare), 25 (time series of sensor readings)
  • Watch out: Synthetic data. Good for learning the workflow, but the patterns are cleaner than real manufacturing data.

Finance

Credit Card Fraud Detection

  • Source: Kaggle (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
  • Size: 284,807 transactions, 30 PCA-transformed features, ~150 MB
  • Description: Transactions made by European cardholders in September 2013. Features are PCA-transformed for privacy. Contains 492 frauds out of 284,807 transactions (0.172% positive rate).
  • Good for: Chapters 17 (extreme class imbalance — the textbook example), 16 (precision-recall curves over ROC), 22 (anomaly detection framing), 14 (gradient boosting on imbalanced data)
  • Watch out: Features are already PCA-transformed, so feature engineering and interpretation are limited. The imbalance ratio is the real learning opportunity.

Lending Club Loan Data

  • Source: Kaggle (various uploads; https://www.kaggle.com/datasets/wordsforthewise/lending-club)
  • Size: ~2.2 million loans, 150+ features, ~1.7 GB
  • Description: Loan-level data including borrower demographics, credit history, loan characteristics, and repayment outcomes for loans issued through the Lending Club platform.
  • Good for: Chapters 6 (feature engineering from financial history), 7 (categorical encoding of employment title, state, purpose), 9 (feature selection from 150+ columns), 11 (logistic regression for credit scoring), 14 (gradient boosting), 19 (SHAP for loan denial explanations), 28 (large dataset techniques), 33 (fairness in lending decisions)
  • Watch out: Contains demographic features (zip code, state) that proxy for race. Fairness analysis is not optional here — it is the point.

S&P 500 Stock Data

  • Source: Yahoo Finance API via yfinance Python package, or Kaggle historical datasets
  • Size: Varies — daily OHLCV data for 500+ stocks over 20+ years
  • Description: Daily open, high, low, close prices and trading volume for S&P 500 constituent stocks.
  • Good for: Chapters 25 (time series analysis and forecasting), 6 (feature engineering — returns, volatility, moving averages), 20 (clustering stocks by behavior)
  • Watch out: Financial time series are notoriously hard to predict. If your model looks too good, you have data leakage. Use this to learn the techniques, not to build a trading strategy.

NLP and Text

IMDB Movie Reviews

  • Source: Stanford AI Lab (https://ai.stanford.edu/~amaas/data/sentiment/) or tensorflow_datasets
  • Size: 50,000 reviews (25K train, 25K test), ~65 MB
  • Description: Movie reviews labeled as positive or negative sentiment. A clean binary classification benchmark.
  • Good for: Chapters 26 (NLP fundamentals — bag of words, TF-IDF, tokenization), 14 (gradient boosting on text features), 12 (SVM on text — historically strong baseline)
  • Watch out: Balanced classes and clean text make this easier than real-world NLP tasks. Good for learning the pipeline, not representative of production complexity.

20 Newsgroups

  • Source: scikit-learn (sklearn.datasets.fetch_20newsgroups)
  • Size: ~18,000 documents across 20 categories, ~14 MB
  • Description: Posts from 20 different Usenet newsgroups, covering topics from politics to sports to computer hardware.
  • Good for: Chapters 26 (multiclass text classification, topic modeling), 21 (dimensionality reduction on text — LSA/truncated SVD), 15 (Naive Bayes — the classic text classification baseline)
  • Watch out: Some categories overlap (e.g., comp.sys.ibm.pc.hardware and comp.sys.mac.hardware). Headers and footers can leak category information; remove them.

Amazon Product Reviews

  • Source: Various Kaggle uploads; also available through Hugging Face datasets
  • Size: Varies by subset — typically 500K-5M reviews
  • Description: Product reviews with star ratings, review text, product category, and helpfulness votes.
  • Good for: Chapters 26 (sentiment analysis, review summarization), 23 (association rules — frequently co-purchased products), 24 (recommender systems from review ratings), 6 (feature engineering from text — review length, sentiment score, helpfulness ratio)
  • Watch out: Star ratings are heavily skewed toward 4-5 stars. If treating as binary sentiment, the cutoff you choose matters.

General ML Benchmarks

California Housing

  • Source: scikit-learn (sklearn.datasets.fetch_california_housing)
  • Size: 20,640 rows, 8 features
  • Description: Median house values for California census block groups, based on 1990 census data. Features include median income, average rooms, average occupancy, latitude, longitude.
  • Good for: Chapters 11 (linear regression baseline), 13 (tree-based regression), 14 (gradient boosting regression), 18 (hyperparameter tuning), 19 (SHAP for regression), 27 (geospatial analysis — latitude/longitude)
  • Watch out: Old data (1990). Good for learning regression techniques, not for making current real estate predictions.

Adult Income (Census Income)

  • Source: UCI ML Repository (https://archive.ics.uci.edu/dataset/2/adult)
  • Size: 48,842 rows, 14 features
  • Description: Predict whether income exceeds $50K/year based on census data. Features include age, education, occupation, race, sex, hours-per-week.
  • Good for: Chapters 7 (categorical encoding — workclass, education, occupation, relationship), 11 (logistic regression), 13 (tree-based classification), 33 (fairness analysis — the standard benchmark for algorithmic fairness research)
  • Watch out: This dataset is the default fairness benchmark for a reason. Race and sex are features. If you build a model without a fairness audit, you are doing it wrong.

Titanic

  • Source: Kaggle (https://www.kaggle.com/c/titanic)
  • Size: 891 rows (train), 12 features
  • Description: Passenger survival prediction on the RMS Titanic. Features include class, name, sex, age, siblings/spouses, parents/children, ticket, fare, cabin, port of embarkation.
  • Good for: Chapters 7 (categorical encoding), 8 (missing data — Age and Cabin have substantial missingness), 13 (decision trees for interpretable classification)
  • Watch out: Overused. Every tutorial uses this dataset. It is fine for the first three hours of learning; after that, move on.

Ames Housing

  • Source: Kaggle (https://www.kaggle.com/c/house-prices-advanced-regression-techniques)
  • Size: 1,460 rows (train), 79 features
  • Description: Residential housing prices in Ames, Iowa, with detailed property characteristics including lot size, building quality, number of rooms, garage type, and sale condition.
  • Good for: Chapters 6 (extensive feature engineering), 7 (ordinal and nominal categoricals), 8 (missing data with clear patterns — "no garage" vs. truly missing), 9 (feature selection from 79 candidates), 11 (linear regression with feature engineering), 18 (hyperparameter tuning)
  • Watch out: Much richer than California Housing for feature engineering practice. The 79 features force you to make real feature selection decisions.

Government and Public Data Portals

These are not single datasets but portals with hundreds of datasets across domains:

  • data.gov (https://data.gov): US federal government open data. Thousands of datasets covering health, education, transportation, energy, and more.
  • data.gov.uk (https://data.gov.uk): UK government open data portal.
  • Eurostat (https://ec.europa.eu/eurostat/data/database): Statistical office of the European Union. Economic, demographic, and social data.
  • World Bank Open Data (https://data.worldbank.org): Development indicators for every country. Good for geospatial and time series practice.
  • data.census.gov (https://data.census.gov): US Census Bureau. Demographic and economic data at multiple geographic levels.

These portals are where you will find datasets for your capstone project (Chapter 35) and for building a portfolio that demonstrates you can work with messy, real-world data — not just Kaggle competitions.


Choosing the Right Dataset

A decision framework:

Your Goal Start Here
Learn the ML workflow end-to-end Heart Disease (small, clean, fast iteration)
Practice feature engineering Ames Housing or Lending Club
Practice SQL MIMIC-III or Olist (relational tables)
Practice class imbalance Credit Card Fraud or Diabetes Readmission
Practice NLP IMDB (binary) then 20 Newsgroups (multiclass)
Practice time series C-MAPSS or S&P 500
Practice anomaly detection Credit Card Fraud or C-MAPSS
Practice fairness analysis Adult Income or Lending Club
Build a portfolio project Government portals + domain expertise
Mirror the textbook anchors Diabetes Readmission (hospital), Olist (e-commerce), C-MAPSS (manufacturing)

The best dataset for learning is one you care about. Domain knowledge is a feature.