Appendix I: Key Papers and Reading Lists

An Annotated Guide to the Literature That Shaped Modern Data Science


This appendix provides an annotated reading list of seminal and influential papers organized by the seven parts of this book. For each paper, we include the full citation, a brief summary, a difficulty rating, and guidance on reading order. The goal is not completeness — the ML literature grows by thousands of papers per month — but to identify the papers that are most worth reading carefully, the papers that changed how the field thinks.

Difficulty ratings: - Introductory — Accessible to a reader who has completed the relevant chapters. Good first papers in a subfield. - Intermediate — Requires comfort with the mathematical foundations (Part I) and familiarity with the subfield. The bulk of the papers listed here. - Advanced — Assumes deep fluency with the material. Research-frontier work. Best read after completing the book.

How to use this appendix: Start with the papers marked "start here" in each section. Use the reading order suggestions to build depth progressively. Apply the three-pass reading strategy from Chapter 37 to every paper. Do not try to read all 100+ papers — select the subfields most relevant to your work and read deeply there.


Part I: Mathematical and Computational Foundations

Linear Algebra and Optimization

  1. Strang, G. (1993). "The Fundamental Theorem of Linear Algebra." The American Mathematical Monthly, 100(9), 848-855. Not a research paper but an expository article by the master of linear algebra pedagogy. Presents the four fundamental subspaces and their relationships in a way that illuminates every application in this book. Start here for Part I. (Introductory)

  2. Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). "Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions." SIAM Review, 53(2), 217-288. The definitive treatment of randomized SVD and related methods. Explains why randomized algorithms can compute SVDs faster than deterministic methods while maintaining accuracy guarantees. Essential for understanding scalable dimensionality reduction. (Intermediate)

  3. Ruder, S. (2016). "An Overview of Gradient Descent Optimization Algorithms." arXiv:1609.04747. The most widely cited survey of SGD variants (momentum, RMSProp, Adam, etc.). Clear presentation of each algorithm's motivation and behavior. Excellent companion to Chapter 2. (Introductory)

  4. Kingma, D. P. & Ba, J. (2015). "Adam: A Method for Stochastic Optimization." Proceedings of ICLR 2015. Introduces the Adam optimizer — adaptive learning rates with momentum. The default optimizer for most deep learning. Read alongside the AdamW correction (Loshchilov & Hutter, 2019). (Introductory)

  5. Loshchilov, I. & Hutter, F. (2019). "Decoupled Weight Decay Regularization." Proceedings of ICLR 2019. Shows that L2 regularization and weight decay are not equivalent for adaptive optimizers like Adam. Introduces AdamW, which is now the default for transformer training. A subtle but important distinction. (Intermediate)

Probability and Information Theory

  1. Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). "Variational Inference: A Review for Statisticians." Journal of the American Statistical Association, 112(518), 859-877. The clearest introduction to variational inference for readers with a statistics background. Derives the ELBO, explains mean-field approximation, and connects VI to EM and MCMC. Essential reading before Chapter 21. (Intermediate)

  2. Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley. The textbook, not a paper, but Chapter 2 (Entropy, Relative Entropy, and Mutual Information) is the canonical reference for the information-theoretic concepts in Chapter 4. Worth reading as a standalone chapter. (Intermediate)


Part II: Deep Learning

Foundational Architectures

  1. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning Representations by Back-propagating Errors." Nature, 323, 533-536. The paper that introduced backpropagation to the broad scientific community. Short, elegant, and historically essential. Read to understand where the field began. (Introductory)

  2. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-Based Learning Applied to Document Recognition." Proceedings of the IEEE, 86(11), 2278-2324. Introduces LeNet-5 and the modern CNN architecture. The entire pipeline (convolution, pooling, backpropagation through convolutional layers) is presented here. A long paper, but the figures alone are worth the read. (Introductory)

  3. He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." Proceedings of CVPR 2016. Introduces residual connections (skip connections), enabling training of networks with 100+ layers. One of the most cited papers in deep learning. The key insight — identity mappings allow gradient flow — extends beyond computer vision. Start here for deep learning architecture papers. (Introductory)

  4. Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735-1780. Introduces the LSTM architecture. Dense and mathematical, but the core insight (additive gradient flow through the cell state) is beautifully simple. Historical context for Chapter 9. (Intermediate)

  5. Ioffe, S. & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." Proceedings of ICML 2015. Introduces batch normalization. The theoretical motivation (internal covariate shift) has been questioned, but the technique remains ubiquitous. Read alongside Santurkar et al. (2018), "How Does Batch Normalization Help Optimization?" which provides a more accurate explanation. (Introductory)

  6. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research, 15, 1929-1958. Introduces dropout regularization. The paper is notable for its clarity and the ensemble interpretation of dropout. (Introductory)

Transformers and Attention

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." Proceedings of NeurIPS 2017. The paper. Introduces the transformer architecture: multi-head self-attention, positional encoding, and the encoder-decoder structure. Required reading for every data scientist. Apply the three-pass strategy from Chapter 37. Start here for the transformer literature. (Intermediate)

  2. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL 2019. Introduces BERT and the pretrain-then-fine-tune paradigm that dominates modern NLP. Masked language modeling as a pretraining objective. The paper that made transfer learning standard in NLP. (Intermediate)

  3. Brown, T. et al. (2020). "Language Models are Few-Shot Learners." Proceedings of NeurIPS 2020. The GPT-3 paper. Demonstrates that scaling language models to 175B parameters enables few-shot learning without fine-tuning. Introduces in-context learning. A turning point for the field. (Intermediate)

  4. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." Proceedings of NeurIPS 2022. Shows that attention can be computed 2-4x faster by exploiting GPU memory hierarchy (SRAM vs. HBM). Same mathematical result, dramatically better wall-clock time. The canonical example of why understanding hardware matters for ML. (Advanced)

  5. Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. Demonstrates that smaller, well-trained open models can match or exceed larger proprietary models. Revitalized open-source LLM research. Read alongside the Chinchilla paper (Hoffmann et al., 2022) for the scaling laws that motivated LLaMA's training recipe. (Intermediate)

  6. Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." Proceedings of NeurIPS 2022. The "Chinchilla" paper. Shows that most LLMs are overtrained relative to their data — the compute-optimal strategy involves smaller models trained on more data. Changed how the field thinks about scaling. (Intermediate)

Generative Models

  1. Kingma, D. P. & Welling, M. (2014). "Auto-Encoding Variational Bayes." Proceedings of ICLR 2014. Introduces the VAE. The reparameterization trick — a beautifully simple idea that enables gradient-based optimization of latent variable models. Read alongside the ELBO derivation in Chapter 12. Start here for generative models. (Intermediate)

  2. Goodfellow, I. et al. (2014). "Generative Adversarial Nets." Proceedings of NeurIPS 2014. Introduces GANs. The minimax formulation is elegant; the practical training challenges kept the field busy for years. Read the paper for the idea, then read Arjovsky et al. (2017) for why training is hard. (Intermediate)

  3. Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." Proceedings of NeurIPS 2020. The paper that made diffusion models practical. Introduces the simplified training objective (predict the noise) and the connection to score matching. The foundation for DALL-E 2, Stable Diffusion, and modern image generation. (Intermediate)

  4. Arjovsky, M., Chintala, S., & Bottou, L. (2017). "Wasserstein GAN." Proceedings of ICML 2017. Diagnoses the training instability of GANs (mode collapse, vanishing gradients) and proposes the Wasserstein distance as an alternative training objective. Elegant theoretical analysis. (Advanced)

Graph Neural Networks

  1. Kipf, T. N. & Welling, M. (2017). "Semi-Supervised Classification with Graph Convolutional Networks." Proceedings of ICLR 2017. Introduces the GCN layer — a simplified spectral graph convolution that operates as neighborhood aggregation. The paper that launched the modern GNN era. Start here for graph neural networks. (Intermediate)

  2. Hamilton, W., Ying, Z., & Leskovec, J. (2017). "Inductive Representation Learning on Large Graphs." Proceedings of NeurIPS 2017. Introduces GraphSAGE — scalable GNNs via neighborhood sampling. Unlike GCN, GraphSAGE can generalize to unseen nodes, making it practical for production recommendation systems. (Intermediate)

  3. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). "Graph Attention Networks." Proceedings of ICLR 2018. Introduces attention mechanisms for graphs. Attention weights allow the model to learn which neighbors are most important — a natural extension of the transformer attention to non-Euclidean data. (Intermediate)

  4. Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). "How Powerful are Graph Neural Networks?" Proceedings of ICLR 2019. Characterizes the expressive power of GNNs via the Weisfeiler-Leman graph isomorphism hierarchy. Shows that standard message-passing GNNs are at most as powerful as the 1-WL test. Introduces the Graph Isomorphism Network (GIN). (Advanced)

Transfer Learning and Foundation Models

  1. Radford, A., Kim, J. W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models from Natural Language Supervision." Proceedings of ICML 2021. Introduces CLIP — contrastive learning between images and text. Demonstrates that natural language supervision enables zero-shot transfer to new visual tasks. A foundation for multimodal AI. (Intermediate)

  2. Hu, E. J. et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." Proceedings of ICLR 2022. Introduces LoRA — parameter-efficient fine-tuning by injecting low-rank updates into transformer layers. Reduces fine-tuning memory by 10x while maintaining quality. The practical default for adapting foundation models. (Intermediate)


Part III: Causal Inference

Foundational Frameworks

  1. Rubin, D. B. (1974). "Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies." Journal of Educational Psychology, 66(5), 688-701. The paper that formalized the potential outcomes framework. Defines $Y(0), Y(1)$, the fundamental problem of causal inference, and the assumptions required for causal identification from observational data. Start here for causal inference. (Introductory)

  2. Holland, P. W. (1986). "Statistics and Causal Inference." Journal of the American Statistical Association, 81(396), 945-960. The "no causation without manipulation" paper. A clear exposition of Rubin's framework with the famous articulation of the fundamental problem. Accessible and thought-provoking. (Introductory)

  3. Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge University Press. The book, not a paper. Introduces structural causal models, the do-calculus, and the graphical approach to causal inference. Chapters 1-4 are the essential reading for the graphical framework in Chapter 17. (Intermediate to Advanced)

  4. Pearl, J. (1995). "Causal Diagrams for Empirical Research." Biometrika, 82(4), 669-688. Introduces the backdoor criterion and front-door criterion for causal identification using DAGs. The paper that brought graphical models to causal inference. More accessible than the full Causality book. (Intermediate)

  5. Imbens, G. W. & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press. The definitive textbook on the potential outcomes approach. Chapters on matching, propensity scores, and instrumental variables are the gold standard treatments. Read selectively as a reference alongside Chapters 16-18. (Intermediate)

Estimation Methods

  1. Rosenbaum, P. R. & Rubin, D. B. (1983). "The Central Role of the Propensity Score in Observational Studies for Causal Effects." Biometrika, 70(1), 41-55. Introduces the propensity score and proves the balancing property. The foundation for matching and IPW methods in Chapter 18. (Intermediate)

  2. Angrist, J. D. & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press. Not a paper but the most readable introduction to IV, DiD, and RDD from the econometrics perspective. Chapters 4-6 are essential reading for Chapter 18. Witty and opinionated. (Introductory)

  3. Angrist, J. D. & Imbens, G. W. (1994). "Identification and Estimation of Local Average Treatment Effects." Econometrica, 62(2), 467-475. Defines the LATE — the effect of treatment on "compliers" — and clarifies what instrumental variables actually estimate. Essential for understanding the limitations of IV methods. (Intermediate)

  4. Abadie, A., Diamond, A., & Hainmueller, J. (2010). "Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program." Journal of the American Statistical Association, 105(490), 493-505. Introduces the synthetic control method — constructing a weighted combination of comparison units to serve as a counterfactual. Elegant and practical for policy evaluation. (Intermediate)

  5. Callaway, B. & Sant'Anna, P. H. C. (2021). "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, 225(2), 200-230. Addresses the pitfalls of two-way fixed effects DiD with staggered treatment adoption. A critical correction to how DiD was practiced for decades. Read if your DiD has multiple treatment periods. (Advanced)

Causal Machine Learning

  1. Athey, S. & Imbens, G. W. (2016). "Recursive Partitioning for Heterogeneous Causal Effects." Proceedings of the National Academy of Sciences, 113(27), 7353-7360. Introduces causal trees — decision trees modified to estimate heterogeneous treatment effects. The predecessor to causal forests. Start here for causal ML. (Intermediate)

  2. Wager, S. & Athey, S. (2018). "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests." Journal of the American Statistical Association, 113(523), 1228-1242. Introduces causal forests with honest estimation (sample-splitting) and asymptotic confidence intervals for CATE estimates. The primary method for HTE estimation in Chapter 19. (Advanced)

  3. Chernozhukov, V. et al. (2018). "Double/Debiased Machine Learning for Treatment and Structural Parameters." The Econometrics Journal, 21(1), C1-C68. Introduces Double ML — using ML for nuisance parameter estimation while maintaining valid causal inference through Neyman orthogonality and cross-fitting. The theoretical foundation for Chapter 19's DML treatment. (Advanced)

  4. Kunzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). "Metalearners for Estimating Heterogeneous Treatment Effects using Machine Learning." Proceedings of the National Academy of Sciences, 116(10), 4156-4165. Introduces the X-learner and provides a unified comparison of S-learner, T-learner, and X-learner meta-learners for CATE estimation. Clear and practical. (Intermediate)

  5. Sharma, A., Kiciman, E., et al. (2020). "DoWhy: An End-to-End Library for Causal Inference." arXiv:2011.04216. Introduces the DoWhy library and its four-step workflow: model (causal graph) → identify (estimand) → estimate → refute. The practical causal inference workflow used throughout Part III. (Introductory)


Part IV: Bayesian and Temporal Data Science

Bayesian Methods

  1. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis, 3rd ed. Chapman & Hall/CRC. The definitive Bayesian textbook. Chapters 1-5 (fundamentals), 10-12 (MCMC), and 15 (hierarchical models) are the essential references for Chapters 20-21. Available free online from the authors. (Intermediate)

  2. Hoffman, M. D. & Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo." Journal of Machine Learning Research, 15, 1593-1623. Introduces NUTS — the MCMC algorithm used by Stan and PyMC. Removes the need to hand-tune the leapfrog step count in HMC. A technical paper, but the introduction explains why HMC is better than Metropolis-Hastings. (Advanced)

  3. Vehtari, A., Gelman, A., & Gabry, J. (2017). "Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC." Statistics and Computing, 27(5), 1413-1432. Introduces PSIS-LOO — efficient leave-one-out cross-validation for Bayesian models via Pareto smoothed importance sampling. The standard model comparison tool in ArviZ. (Intermediate)

  4. Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., & Gelman, A. (2019). "Visualization in Bayesian Workflow." Journal of the Royal Statistical Society: Series A, 182(2), 389-402. Defines the Bayesian workflow: prior predictive check → fit → MCMC diagnostics → posterior predictive check → model comparison. The workflow implemented in Chapter 21. (Introductory)

Bayesian Optimization and Bandits

  1. Snoek, J., Larochelle, H., & Adams, R. P. (2012). "Practical Bayesian Optimization of Machine Learning Hyperparameters." Proceedings of NeurIPS 2012. Applies Bayesian optimization with Gaussian processes to hyperparameter tuning. The paper that brought BO to the ML community. Start here for Chapter 22. (Intermediate)

  2. Thompson, W. R. (1933). "On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples." Biometrika, 25(3-4), 285-294. The original Thompson sampling paper — published 90 years ago and still the algorithm of choice for many bandit problems. Short and elegant. (Introductory)

  3. Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). "A Tutorial on Thompson Sampling." Foundations and Trends in Machine Learning, 11(1), 1-96. A comprehensive tutorial covering the theory, analysis, and applications of Thompson sampling. Connects to exploration in recommendation systems (Chapter 22). (Intermediate)

  4. Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). "A Contextual-Bandit Approach to Personalized News Article Recommendation." Proceedings of WWW 2010. Applies contextual bandits (LinUCB) to news recommendation — one of the first papers to bridge bandits and recommendation systems. Directly relevant to the StreamRec exploration strategy. (Intermediate)

Time Series

  1. Lim, B., Arik, S. O., Loeff, N., & Pfister, T. (2021). "Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting." International Journal of Forecasting, 37(4), 1748-1764. Introduces the Temporal Fusion Transformer (TFT) — a transformer architecture designed for time series with interpretable attention mechanisms and variable selection. The primary DL time series model in Chapter 23. (Intermediate)

  2. Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2020). "N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting." Proceedings of ICLR 2020. Introduces N-BEATS — a pure DL approach to time series that requires no hand-crafted features. Demonstrates that deep learning can match or exceed statistical methods on standard forecasting benchmarks. (Intermediate)

  3. Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). "DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks." International Journal of Forecasting, 36(3), 1181-1191. Introduces DeepAR — autoregressive probabilistic forecasting that learns across multiple related time series. The key idea: sharing parameters across series enables learning from limited data per series. (Intermediate)


Part V: Production ML Systems

System Design and Architecture

  1. Sculley, D. et al. (2015). "Hidden Technical Debt in Machine Learning Systems." Proceedings of NeurIPS 2015 (Workshop). The "ML systems" paper. Argues that the ML model is a small fraction of a production ML system; the rest is data pipelines, monitoring, configuration, and infrastructure. The motivation for the entire Part V. Start here for production ML. (Introductory)

  2. Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2017). "Data Management Challenges in Production Machine Learning." Proceedings of SIGMOD 2017. Catalogs data management challenges in production ML: data validation, feature management, training-serving skew, and data lifecycle. From the Google TFX team. (Introductory)

  3. Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction." Proceedings of IEEE Big Data 2017. Proposes a checklist for production ML readiness: tests for features, data, model, infrastructure, and monitoring. The testing strategy framework used in Chapter 28. (Introductory)

  4. Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2022). "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys, 55(6), 1-29. A survey of real-world deployment challenges across domains. Catalogs common failure modes: data quality, feedback loops, monitoring gaps, and organizational issues. Excellent for understanding the gap between research and production. (Introductory)

ML Pipelines and Infrastructure

  1. Baylor, D. et al. (2017). "TFX: A TensorFlow-Based Production-Scale Machine Learning Platform." Proceedings of KDD 2017. Describes Google's TFX platform: data validation, feature transformation, training, evaluation, and serving. The reference architecture for ML pipelines (Chapter 27). (Intermediate)

  2. Zaharia, M. et al. (2018). "Accelerating the Machine Learning Lifecycle with MLflow." IEEE Data Engineering Bulletin, 41(4), 39-45. Introduces MLflow: experiment tracking, model registry, and deployment. The most widely adopted open-source MLOps tool. (Introductory)

  3. Hermann, J. & Del Balso, M. (2017). "Meet Michelangelo: Uber's Machine Learning Platform." Uber Engineering Blog. Describes Uber's ML platform with emphasis on feature stores, model management, and online/offline prediction serving. One of the first public descriptions of a feature store architecture. (Introductory)

Experimentation

  1. Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. The book on A/B testing at scale, from the team that built Microsoft's experimentation platform. Chapters on interference, variance reduction, and common pitfalls are essential references for Chapter 33. (Introductory)

  2. Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data." Proceedings of WSDM 2013. Introduces CUPED — variance reduction using pre-experiment covariates. One of the highest-impact practical techniques for experimentation at scale. (Intermediate)

  3. Johari, R., Koomen, P., Pekelis, L., & Walsh, D. (2017). "Peeking at A/B Tests: Why It Matters, and What to Do About It." Proceedings of KDD 2017. Formalizes the "peeking problem" (checking experiment results repeatedly inflates false positive rates) and proposes always-valid p-values as a solution. Essential for sequential testing in Chapter 33. (Intermediate)


Part VI: Responsible and Rigorous Data Science

Fairness

  1. Chouldechova, A. (2017). "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." Big Data, 5(2), 153-163. Proves the impossibility theorem: calibration and equal false positive/negative rates across groups cannot simultaneously hold when base rates differ. The mathematical foundation of Chapter 31's fairness treatment. Start here for algorithmic fairness. (Intermediate)

  2. Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). "Inherent Trade-Offs in the Fair Determination of Risk Scores." Proceedings of ITCS 2017. Independent proof of the impossibility result, from the computer science perspective. Together with Chouldechova (2017), establishes that fairness metric selection is an ethical choice, not a technical optimization. (Intermediate)

  3. Hardt, M., Price, E., & Srebro, N. (2016). "Equality of Opportunity in Supervised Learning." Proceedings of NeurIPS 2016. Introduces the equalized odds and equal opportunity fairness criteria. Proposes post-processing threshold adjustment — the simplest fairness intervention. (Intermediate)

  4. Buolamwini, J. & Gebru, T. (2018). "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification." Proceedings of FAT 2018*. Demonstrates that commercial facial recognition systems have dramatically higher error rates for darker-skinned women. A landmark paper that shaped the fairness discourse and regulatory landscape. (Introductory)

  5. Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. The definitive textbook on algorithmic fairness. Comprehensive treatment of fairness definitions, impossibility results, and organizational practice. Available free online. (Introductory to Intermediate)

Privacy

  1. Dwork, C. & Roth, A. (2014). "The Algorithmic Foundations of Differential Privacy." Foundations and Trends in Theoretical Computer Science, 9(3-4), 211-407. The canonical reference for differential privacy theory. Derives the Laplace and Gaussian mechanisms, composition theorems, and connections to statistical inference. Chapters 1-3 are essential for Chapter 32. Start here for differential privacy. (Intermediate)

  2. Abadi, M. et al. (2016). "Deep Learning with Differential Privacy." Proceedings of CCS 2016. Introduces DP-SGD — training neural networks with differential privacy guarantees via per-sample gradient clipping and noise addition. The foundation for Opacus and Chapter 32's DP training. (Intermediate)

  3. McMahan, B. et al. (2017). "Communication-Efficient Learning of Deep Networks from Decentralized Data." Proceedings of AISTATS 2017. Introduces Federated Averaging (FedAvg) — the foundational federated learning algorithm. Data stays on-device; only model updates are communicated. (Intermediate)

  4. Kairouz, P. et al. (2021). "Advances and Open Problems in Federated Learning." Foundations and Trends in Machine Learning, 14(1-2), 1-210. A comprehensive survey of federated learning covering algorithms, systems challenges (heterogeneity, communication, privacy), and applications. The reference for Chapter 32's FL treatment. (Advanced)

Uncertainty and Calibration

  1. Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." Proceedings of ICML 2017. Shows that modern deep networks are poorly calibrated — they are systematically overconfident. Introduces temperature scaling as a simple and effective recalibration method. Start here for calibration. (Introductory)

  2. Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer. The book that introduced conformal prediction — distribution-free prediction sets with finite-sample coverage guarantees. Chapter 2 provides the theoretical foundation for Chapter 34. (Advanced)

  3. Angelopoulos, A. N. & Bates, S. (2023). "Conformal Prediction: A Gentle Introduction." Foundations and Trends in Machine Learning, 16(4), 494-591. The most accessible introduction to conformal prediction. Covers split conformal, adaptive conformal inference, and connections to other uncertainty quantification methods. (Introductory)

  4. Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles." Proceedings of NeurIPS 2017. Demonstrates that ensembles of neural networks provide well-calibrated uncertainty estimates that often outperform Bayesian approaches. The practical baseline for uncertainty quantification. (Intermediate)

  5. Gal, Y. & Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." Proceedings of ICML 2016. Shows that dropout at test time (MC Dropout) approximates Bayesian inference, providing uncertainty estimates from any dropout-trained network. Elegant and practical. (Intermediate)

Interpretability and Explainability

  1. Lundberg, S. M. & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." Proceedings of NeurIPS 2017. Introduces SHAP — connecting Shapley values from cooperative game theory to feature attribution for ML models. Unifies LIME, DeepLIFT, and other methods under a single framework. Start here for explainability. (Intermediate)

  2. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "'Why Should I Trust You?': Explaining the Predictions of Any Classifier." Proceedings of KDD 2016. Introduces LIME — local interpretable model-agnostic explanations via perturbation-based local linear approximations. (Introductory)

  3. Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList." Proceedings of ACL 2020. Introduces behavioral testing for ML models: invariance tests, directional tests, and minimum functionality tests. The testing framework used in Chapter 28. (Introductory)

  4. Kim, B. et al. (2018). "Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)." Proceedings of ICML 2018. Introduces concept-based explanations — explaining model behavior in terms of human-understandable concepts ("stripedness," "texture") rather than individual features. (Intermediate)


Part VII: Leadership and Synthesis

Research Methodology and Practice

  1. Lipton, Z. C. & Steinhardt, J. (2019). "Troubling Trends in Machine Learning Scholarship." Queue, 17(1), 45-77. A critical examination of common problems in ML papers: failure to identify the source of gains, mathiness, and misuse of language. Essential reading alongside Chapter 37's paper-reading methodology. (Introductory)

  2. Pineau, J. et al. (2021). "Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)." Journal of Machine Learning Research, 22(164), 1-20. Reports on the state of ML reproducibility and proposes concrete standards: reproducibility checklists, code submission, and standardized reporting. (Introductory)

  3. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). "Deep Reinforcement Learning That Matters." Proceedings of AAAI 2018. Demonstrates that many reported gains in deep RL are within the variance of random seeds and hyperparameter choices. A cautionary tale about evaluation methodology. (Intermediate)

Recommendation Systems (cross-cutting)

  1. Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." Computer, 42(8), 30-37. The Netflix Prize paper — matrix factorization for collaborative filtering. The mathematical foundation of Chapter 1's StreamRec milestone. Start here for recommendation systems. (Introductory)

  2. Covington, P., Adams, J., & Sargin, E. (2016). "Deep Neural Networks for YouTube Recommendations." Proceedings of RecSys 2016. Describes YouTube's two-stage recommendation architecture: candidate generation + ranking. The architecture pattern used throughout the StreamRec project. (Introductory)

  3. Yi, X. et al. (2019). "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations." Proceedings of RecSys 2019. Addresses the sampling bias in two-tower models trained on implicit feedback (users only interact with items that were shown to them). Proposes in-batch negative sampling correction. (Intermediate)

  4. Chen, M., Beutel, A., Covington, P., Jain, S., Belletti, F., & Chi, E. (2019). "Top-K Off-Policy Correction for a REINFORCE Recommender System." Proceedings of WSDM 2019. Applies off-policy reinforcement learning to recommendations at YouTube scale. Directly relevant to the causal evaluation of recommendations in Chapter 19. (Advanced)

Scientific ML (Climate)

  1. Reichstein, M. et al. (2019). "Deep Learning and Process Understanding for Data-Driven Earth System Science." Nature, 566, 195-204. A manifesto for combining deep learning with physical process models in earth science. The conceptual foundation for the Pacific Climate anchor example. (Introductory)

  2. Ravuri, S. et al. (2021). "Skilful Precipitation Nowcasting Using Deep Generative Models of Radar." Nature, 597, 672-677. Applies generative models (conditional GANs) to weather nowcasting, outperforming physics-based models at short lead times. A landmark application of deep learning in climate science. (Intermediate)

ML in Healthcare (Pharma)

  1. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). "Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations." Science, 366(6464), 447-453. Reveals that a widely used health algorithm exhibited racial bias because it used health care costs as a proxy for health needs, and costs are confounded by differential access to care. Essential reading for causal thinking in healthcare. (Introductory)

  2. Hernán, M. A. & Robins, J. M. (2020). Causal Inference: What If. Chapman & Hall/CRC. The epidemiological perspective on causal inference. Chapters on time-varying treatments, inverse probability weighting, and instrumental variables are excellent complements to Chapter 18. Available free online. (Intermediate)

ML in Finance (Credit Scoring)

  1. Federal Reserve (2011). "Supervisory Guidance on Model Risk Management (SR 11-7)." The U.S. regulatory framework for model risk management. Defines model validation requirements, documentation standards, and governance expectations. Required reading for anyone building ML in financial services. Not a paper, but essential reference for Chapter 35's regulatory treatment. (Introductory)

  2. Bartlett, R., Morse, A., Stanton, R., & Wallace, N. (2022). "Consumer-Lending Discrimination in the FinTech Era." Journal of Financial Economics, 143(1), 30-56. Documents pricing discrimination in algorithmic lending. Uses causal methods to identify disparate impact in mortgage pricing. Relevant to the Meridian Financial case. (Intermediate)


Cross-Cutting: Surveys and Textbooks

  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. The "bible" of deep learning. Part I (applied math), Part II (deep networks), Part III (research topics) cover the mathematical foundations through advanced architectures. Available free online. Chapters on optimization (8), regularization (7), and CNNs (9) complement Part II of this book. (Intermediate)

  2. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Modern ML through a probabilistic lens. Excellent treatment of Bayesian methods, generative models, and decision theory. Available free online. (Intermediate)

  3. Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. The sequel covering deep generative models, Bayesian deep learning, causal inference, and more. An excellent reference for topics spanning Parts II-IV. Available free online. (Advanced)

  4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. The classic reference for statistical learning. Chapters on boosting, random forests, and regularization remain essential. Available free online. (Intermediate)

  5. Bishop, C. M. & Bishop, H. (2024). Deep Learning: Foundations and Concepts. Springer. A modern deep learning textbook with careful mathematical treatment. Excellent chapters on transformers, diffusion models, and normalizing flows. Complements Part II. (Intermediate)


Suggested Reading Orders

For the practitioner who wants breadth (20 papers):

Papers 3, 10, 14, 20, 29, 30, 35, 40, 44, 48, 56, 58, 63, 66, 71, 75, 77, 80, 87, 88

For the deep learning specialist (15 papers):

Papers 10, 14, 16, 17, 19, 20, 22, 24, 27, 28, 29, 53, 79, 83, 101

For the causal inference specialist (15 papers):

Papers 30, 33, 35, 37, 38, 39, 40, 41, 42, 43, 44, 66, 93, 94, 96

For the production ML engineer (15 papers):

Papers 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 71, 72, 75, 82, 88

For the fairness/privacy researcher (12 papers):

Papers 66, 67, 68, 69, 70, 71, 72, 73, 74, 80, 93, 96


This reading list reflects the literature as of early 2025. The field moves fast. Use Chapter 37's reading strategies to stay current: follow key authors, track major conference proceedings (NeurIPS, ICML, ICLR, KDD, ACL, CVPR), and maintain a personal annotated bibliography. The papers here are landmarks — they will remain relevant long after the field moves on.