Chapter 37: Further Reading
Essential Sources
1. Srinivasan Keshav, "How to Read a Paper" (ACM SIGCOMM Computer Communication Review, 2007)
The paper that introduced the three-pass reading strategy used throughout this chapter. In three pages, Keshav describes a method for efficiently reading research papers that has been adopted by thousands of researchers and practitioners across computer science, data science, and related fields. The paper's clarity and brevity are themselves a model of effective technical writing.
Reading guidance: The paper is short enough to read in a single sitting (approximately 10 minutes). Section 2 describes the three passes in detail, with specific guidance on what to read and what to skip at each stage. Section 3 extends the method to conducting a literature survey — how to use the three-pass approach to efficiently map an unfamiliar research area. The key insight is that most papers should be read only once (the first pass), a few should be read twice (the second pass), and very few should be read three times. The method works because it front-loads the filtering decision: you invest five minutes to determine whether a paper merits fifty minutes, not fifty minutes to determine whether it merits five hours. For practitioners, the first-pass checklist — title, abstract, headings, figures, conclusion, references — is the single highest-value technique. Keshav's paper does not address the evaluation of experimental methodology in depth; for that, supplement with the Lipton and Steinhardt reference below. The paper was written for computer science broadly but applies without modification to machine learning, statistics, and data science.
2. Zachary C. Lipton and Jacob Steinhardt, "Troubling Trends in Machine Learning Scholarship" (ICML Workshop on Debates in Machine Learning, 2018)
A systematic analysis of methodological and rhetorical problems in machine learning research. Lipton and Steinhardt identify four troubling trends: failure to distinguish between explanation and speculation, failure to identify the source of empirical gains (inadequate ablations), mathiness (the use of mathematical formalism to obscure rather than clarify), and misuse of language (terms like "understanding" and "reasoning" applied to systems that do neither). The paper is both a diagnosis and a call to action.
Reading guidance: Section 2 (Failure to Distinguish Explanation and Speculation) directly supports this chapter's discussion of HARKing and post-hoc narrative construction. Lipton and Steinhardt provide concrete examples of papers that present post-hoc observations as pre-registered hypotheses — a practice that inflates apparent predictive success. Section 3 (Failure to Identify the Source of Empirical Gains) is the most practically useful for paper evaluation: it catalogs specific patterns by which papers overstate their contributions, including the practice of comparing against untuned baselines, omitting ablation studies, and reporting improvements on selectively chosen metrics. The paper's examples are drawn from NLP and deep learning, but the patterns they describe are universal across ML subfields. Section 4 (Mathiness) is relevant for readers of theoretical papers — it describes the practice of including mathematical formalism that does not contribute to understanding, either because the formalization is trivial or because it obscures rather than clarifies the contribution. For recommendation systems practitioners, the baseline evaluation guidance in Section 3 is directly applicable to evaluating papers at RecSys, KDD, and WSDM. For causal inference practitioners, Section 2's discussion of explanation vs. speculation maps to the distinction between pre-registered causal hypotheses and post-hoc causal narratives. This paper should be read alongside Keshav — together, they provide the structural framework (Keshav) and the critical evaluation skills (Lipton and Steinhardt) for effective research reading.
3. Edward Raff, "A Step Toward Quantifying Independently Reproducible Machine Learning Research" (NeurIPS, 2019)
The first large-scale quantitative study of reproducibility in machine learning. Raff attempted to reproduce 255 papers across ML subfields, tracking which factors predicted successful reproduction. The overall reproduction rate was 63.5% — meaning that more than one-third of papers could not be independently reproduced.
Reading guidance: Table 1 presents the main results: reproduction success rates by venue, year, and subfield. The key finding is that code availability is the single strongest predictor of reproducibility — papers with released code are roughly twice as likely to be reproducible as papers without. Section 4 analyzes additional factors: clarity of the experimental description, mathematical notation consistency, and whether the paper acknowledges limitations all correlate positively with reproducibility. Section 5 discusses the methodology of reproduction attempts, including the critical decision of how much effort to invest before declaring a reproduction failure (Raff used a threshold of approximately 40 person-hours per paper). The paper's limitations are important: reproduction was attempted by a single researcher, the definition of "successful reproduction" was necessarily subjective (results within a tolerance of the original), and the sample is not random (papers were selected based on relevance to Raff's research interests). Nevertheless, the study provides the strongest quantitative evidence available for the scale of the reproducibility problem in ML. For practitioners, the implication is straightforward: papers without released code should be treated with additional skepticism, and the expected base rate for reproduction is approximately 60-65%, not 100%. When planning to implement a paper, budget time for reproduction failure and have a contingency plan.
4. Jessica Hullman, "The Worst of Both Worlds: A Comparative Analysis of Errors in Learning from Data in Psychology and Machine Learning" (AIES, 2022)
A comparative analysis of methodological errors across psychology (which has faced its own replication crisis since 2011) and machine learning. Hullman identifies structural parallels: both fields suffer from flexible analysis pipelines (researcher degrees of freedom), publication bias toward positive results, and insufficient reporting of uncertainty. The paper argues that ML's emphasis on benchmark performance creates incentive structures similar to those that produced psychology's replication crisis.
Reading guidance: Section 2 maps specific error types between the two fields. P-hacking in psychology (trying multiple statistical tests until finding significance) maps to hyperparameter tuning on the test set in ML. HARKing in psychology (formulating hypotheses after seeing results) maps to post-hoc architecture justification in ML. The "garden of forking paths" (Gelman and Loken, 2014) — the exponentially many analysis choices that could have been made differently — applies to both fields. Section 3 proposes structural interventions, including pre-registration of ML experiments (specifying the evaluation protocol, primary metric, and analysis plan before running experiments), hold-out benchmarks administered by third parties, and separation of the hyperparameter tuning and evaluation datasets. The paper is more analytical than prescriptive — it diagnoses the problem more thoroughly than it solves it — but the diagnosis is valuable for calibrating how much trust to place in published ML results. For readers of this textbook who completed Chapter 33 (Rigorous Experimentation), Hullman's analysis provides the meta-scientific context for why the experimental rigor developed in that chapter is necessary.
5. Yoav Goldberg, "A Primer on Neural Network Models for Natural Language Processing" (JAIR, 2016) — as an example of a well-structured survey paper
Goldberg's primer is included not for its content (which is now dated in parts) but as an exemplar of how to write and read a survey paper. Surveys are the most efficient way to enter a new subfield, but they vary enormously in quality. Goldberg's primer demonstrates the features of a good survey: clear scope definition, consistent notation, honest assessment of method limitations, and comprehensive but selective referencing.
Reading guidance: Rather than reading the paper for its technical content (much of which has been superseded by transformer-based methods), read it as a model of survey writing and reading. Note how Goldberg defines notation once and uses it consistently throughout. Note how each method is described with both its strengths and its limitations. Note how the bibliography is structured — foundational references are clearly distinguished from incremental contributions. When you encounter a new subfield (e.g., causal representation learning, foundation models for science, or any area not covered in this textbook), your first step should be to find a well-written survey of that subfield and apply the three-pass strategy to it. The first pass on a survey tells you the subfield's scope and major themes. The second pass gives you the conceptual framework. The references give you the primary papers for third-pass reading. For finding surveys, Google Scholar's "cited by" feature is invaluable: find the subfield's foundational paper, look at what cites it, and filter for titles containing "survey," "review," "tutorial," or "primer." Alternatively, search arXiv for "[subfield] survey" sorted by recent date.