Chapter 37: Quiz

DataField.Dev

Chapter 37: Quiz

Test your understanding of research reading strategy, methodology evaluation, common pitfalls, and the paper-to-production gap. Answers follow each question.

Question 1

What are the three passes in Keshav's reading strategy, and approximately how long should each take?

Answer

**Pass 1 (Survey, 5-10 minutes):** Read the title, abstract, introduction (first and last paragraphs), section headings, figures/tables (captions especially), conclusion, and references. The goal is to decide whether the paper deserves a second pass. **Pass 2 (Comprehension, 30-60 minutes):** Read the full paper excluding detailed proofs and derivations. Annotate figures, tables, and the experimental setup. Build a mental model accurate enough to explain to a colleague. **Pass 3 (Critical, 2-5 hours):** Recreate key derivations, attempt to reproduce key results, identify implicit assumptions, and map the method to your production context. The recommended ratio is approximately 10:3:1 — for every ten papers you first-pass, you second-pass about three, and third-pass at most one.

Question 2

What four conditions must a fair baseline satisfy?

Answer

A fair baseline must be: **(1) Tuned** — the baseline must be tuned with the same care as the proposed method, not run with default hyperparameters. **(2) Current** — the baseline must represent the current state of the art, not a historical method that has been superseded. **(3) Equivalent** — the baseline must use the same data, features, and evaluation protocol as the proposed method. **(4) Representative** — the baseline set must include diverse approaches spanning different paradigms, not just methods from the same family. A method that beats an untuned, outdated baseline has proven very little.

Question 3

What is an ablation study, and why is it more informative than a comparison against baselines?

Answer

An **ablation study** removes or modifies individual components of a proposed method to measure each component's marginal contribution to performance. The term comes from neuroscience, where brain regions are ablated to study their function. An ablation is more informative than baseline comparison because baselines tell you whether the *full method* outperforms simpler approaches, but they do not tell you *which components* of the method are responsible for the improvement. An ablation that shows Component A contributes 80% of the improvement and Components B-D contribute 20% combined has practical implications: you can capture most of the benefit by implementing only Component A, at a fraction of the complexity and engineering cost. Without an ablation, a complex method is asking you to trust that every component is necessary — a trust that is frequently misplaced.

Question 4

A paper reports a single-run result of 0.847 BLEU on a translation benchmark. What is the methodological concern?

Answer

A single-run result tells you the outcome of one experiment, not a generalizable finding. Training runs in ML are stochastic — they depend on random weight initialization, data shuffling order, and other sources of randomness. Bouthillier et al. (2021) demonstrated that variance from random seeds alone can exceed the difference between methods on standard benchmarks. The minimum standard is reporting results over multiple random seeds (at least three, preferably five) with mean and standard deviation. A better standard is a paired bootstrap test on the difference between methods. Without multiple runs, a reported 0.847 could equally plausibly be the result of a lucky seed producing an outcome from a distribution whose mean is 0.830 — below the baseline it claims to beat.

Question 5

What is dataset leakage, and what are its three most common forms in machine learning?

Answer

**Dataset leakage** occurs when information from the test set influences training, artificially inflating reported performance. The three most common forms are: **(1) Temporal leakage** — training on data that is chronologically after the test data. This is most common in time-series forecasting, recommendation, and fraud detection, where the model can effectively memorize future events. **(2) Preprocessing leakage** — fitting preprocessing steps (normalization, imputation, feature selection) on the full dataset before splitting, so that test set statistics influence training data preprocessing. **(3) Group leakage** — when data points from the same entity (user, patient, session) appear in both training and test sets, allowing the model to memorize entity-level patterns rather than learning generalizable features. A hallmark symptom of leakage is suspiciously high performance (e.g., 0.99 AUC where SOTA is 0.85).

Question 6

Explain the difference between p-hacking and HARKing.

Answer

**P-hacking** (also called data dredging or specification searching) is the practice of trying multiple statistical tests, feature combinations, model configurations, or evaluation settings until a "significant" result emerges. With enough tests, a spurious significant result is guaranteed by chance. In ML, p-hacking manifests as trying many architectures and hyperparameter combinations and reporting only the configuration that achieved the best test performance, without accounting for the multiple comparisons. **HARKing** (Hypothesizing After the Results are Known) is the practice of formulating hypotheses after seeing the data and presenting them as if they were pre-registered predictions. A paper that claims "we hypothesized that attention heads 3 and 7 would specialize for syntax" is almost certainly describing a post-hoc observation as a prediction. Both practices inflate the apparent reliability of findings, but through different mechanisms: p-hacking inflates significance by exploiting multiple testing, while HARKing inflates narrative coherence by reconstructing the story to match the results.

Question 7

What is publication bias, and how should a practitioner account for it when reading the literature?

Answer

**Publication bias** is the systematic tendency for positive results (method beats baseline) to be published while negative results (method does not beat baseline) remain unpublished. This means the published literature is a biased sample of all experiments conducted — the literature systematically overstates the effectiveness of proposed methods. For practitioners, the implication is that when you read five papers proposing a particular approach, all reporting improvements, you are seeing only the successes. The groups that tried similar approaches and failed did not publish. A practical mitigation is to apply a mental "publication bias tax": treat small improvements (1-2%) on standard benchmarks as noise until independently reproduced, and discount reported improvements proportionally to the number of similar papers in the area (more papers means more selection pressure, meaning more inflation).

Question 8

What is the difference between a peer-reviewed paper and a preprint, and how should this affect your reading strategy?

Answer

A **peer-reviewed paper** has been evaluated by (usually) three expert reviewers and an area chair at a conference or journal. The review process provides a minimum quality floor — someone has checked the methodology, questioned the claims, and evaluated the contribution's significance. A **preprint** (typically posted on arXiv) has not been peer-reviewed; it is posted directly by the authors with no quality filter. For reading strategy: peer-reviewed papers at top venues warrant a lower skepticism level — focus evaluation on methodology and production relevance rather than basic soundness. Preprints from established labs warrant similar evaluation. Preprints from unknown groups with extraordinary claims warrant maximum skepticism — the prior probability that a breakthrough was posted to arXiv without peer review is low. This is Bayesian reasoning, not gatekeeping.

Question 9

What is the "leaderboard trap" in machine learning benchmarks?

Answer

The **leaderboard trap** occurs when a benchmark (GLUE, ImageNet, MovieLens, etc.) becomes the primary optimization target rather than a proxy for the underlying task. When methods are optimized for benchmark performance, researchers adopt tricks that improve benchmark scores without generalizing to production data: ensemble averaging, test-time augmentation, task-specific architecture choices, and aggressive hyperparameter tuning on the test set's specific distribution. The result is that leaderboard position becomes an unreliable proxy for practical utility. A method at position 1 on a leaderboard may be specifically engineered for that benchmark's quirks and perform no better (or worse) than a method at position 10 on real-world data. The correct question is not "where does this method rank?" but "does this benchmark reflect my production conditions?"

Question 10

Name the five gaps in the paper-to-production translation framework, and give one example of each.

Answer

**(1) Data gap** — papers use clean, curated datasets; production data has missing values, label noise, and distribution shift. Example: a model trained on curated movie ratings underperforms on production data where 15% of ratings are missing. **(2) Scale gap** — papers evaluate at a single scale; production systems face latency, throughput, and cost constraints. Example: a model that takes 200ms per inference in the paper must serve at 50ms p99 in production. **(3) Evaluation gap** — papers measure offline metrics; production success requires online A/B test validation plus fairness and causal metrics. Example: a model with better offline NDCG@10 fails to improve online engagement minutes. **(4) Infrastructure gap** — papers assume infrastructure exists; production requires building feature stores, training pipelines, and deployment systems. Example: a method requiring real-time feature updates needs a streaming feature store that does not exist. **(5) Maintenance gap** — papers report one-time results; production systems must be maintained, retrained, monitored, and debugged continuously. Example: a complex method requires PhD-level expertise to debug, making it impractical for a team without that expertise.

Question 11

A paper reports that their method achieves Hit@10 of 0.167 on MovieLens-1M. Your current StreamRec production system achieves Hit@10 of 0.155. Is a 12-point improvement (0.012 absolute) worth implementing?

Answer

This question cannot be answered from the numbers alone — it requires analysis across multiple dimensions. **(1) Is the improvement real?** Without confidence intervals, the 0.012 difference may be within the noise range. Multiple-seed evaluation and a paired bootstrap test are needed. **(2) Does it transfer?** MovieLens-1M has ~1M ratings from 6K users on 4K movies. StreamRec has 50M MAU and 200K items. The data distributions, scale, and evaluation conditions are fundamentally different. Expect 20-40% degradation when transferring to production data. **(3) Is it practically significant?** A 0.012 absolute improvement on Hit@10 may or may not translate to meaningful improvement on business metrics (engagement, retention, revenue). Only an A/B test can determine this. **(4) What does it cost?** If implementation requires 8 weeks of engineering and the expected production lift after degradation is 0.005-0.008 Hit@10, the effort-adjusted value must be compared against other investments the team could make. The correct answer is "prototype first" — reproduce the result on StreamRec data, measure the actual improvement, and proceed to A/B testing only if the improvement exceeds the minimum detectable effect.

Question 12

Why is the "Alternatives Considered" section of a paper's related work important for critical reading?

Answer

The "Alternatives Considered" or related work section reveals whether the authors are aware of and engaged with the competitive landscape. A paper that cites and compares against the major competing approaches demonstrates that the proposed method was chosen deliberately — the result of analyzing multiple options against explicit criteria. A paper that ignores major competing methods raises two concerns: (1) the experimental comparison may be incomplete (the method may not beat the alternatives it ignores), and (2) the authors may lack awareness of the field's state of the art, which questions the claim of novelty. The related work section also provides a citation network for further reading — it tells you which papers the authors consider most important, which can guide your own reading.

Question 13

What is the paired bootstrap test, and why is it preferred over independent confidence intervals for comparing two ML methods?

Answer

The **paired bootstrap test** computes a confidence interval on the *difference* in performance between two methods evaluated on the same test instances. It works by resampling the per-instance score differences (score_A - score_B) with replacement many times (e.g., 10,000) and computing the mean of each resample. The 2.5th and 97.5th percentiles of the bootstrap distribution form a 95% confidence interval. It is preferred over independent confidence intervals because the two methods are evaluated on *the same data*, which introduces correlation. Two methods that both struggle on the same difficult examples will have correlated scores. Independent confidence intervals ignore this correlation and may indicate uncertainty where none exists (or vice versa). The paired test directly measures the variability of the *difference*, which is the quantity of interest. If the 95% CI on the difference excludes zero, the difference is statistically significant at the 5% level.

Question 14

You encounter a recommendation paper that reports results on three datasets but omits results on two other standard benchmarks for the task. What pitfall might be at play, and how would you investigate?

Answer

The most likely pitfall is **cherry-picked results** (selective dataset reporting). The paper may have evaluated on all five datasets but reported only the three where the method performed favorably. To investigate: (1) Check the paper's appendix or supplementary material — omitted results are sometimes reported there. (2) Check the paper's code repository (if available) for configuration files or scripts for the omitted datasets — their existence suggests the experiments were run. (3) Check the paper's earlier arXiv versions — sometimes results are included in drafts and removed in revisions. (4) Attempt reproduction on the omitted datasets yourself. (5) Check whether the omitted benchmarks have properties (different data distribution, different sparsity level, different domain) that might disadvantage the proposed method. If the omitted benchmarks are smaller, easier, or from a different domain, the omission is less concerning; if they are standard benchmarks in the exact same task, the omission is a significant red flag.

Question 15

A causal inference paper reports an ATE estimate but includes no sensitivity analysis. Why is this a red flag?

Answer

Every causal estimate from observational data depends on the **no unmeasured confounding** assumption — the assumption that all variables that affect both the treatment and the outcome have been observed and controlled for. This assumption is untestable: there is no statistical test that can confirm the absence of unmeasured confounders. A **sensitivity analysis** quantifies how robust the estimate is to violations of this assumption — specifically, how strong an unmeasured confounder would need to be to reduce the estimated effect to zero or change its sign. Without a sensitivity analysis, the reader cannot assess whether the causal claim is robust or fragile. An ATE of 5.0 that would be nullified by a confounder explaining just 2% of outcome variance is much less convincing than an ATE of 5.0 that remains positive even if an unmeasured confounder explains 30% of outcome variance. The sensitivity analysis (e.g., Ding and VanderWeele, 2016; the E-value) provides this information. Its absence means the paper is presenting a point estimate without any assessment of its credibility under plausible violations of its key assumption.

Question 16

What does it mean for a result to be "statistically significant but not practically significant"? Give an example.

Answer

**Statistical significance** means the measured difference is unlikely to be zero — the evidence is strong enough to reject the null hypothesis. **Practical significance** means the measured difference is large enough to matter for a real-world decision. These are independent: a result can be statistically significant (reliably non-zero) but practically insignificant (too small to change any decision). Example: a recommendation system A/B test with 10 million users detects a statistically significant improvement of 0.02% in click-through rate (p < 0.001). The result is real — with 10 million users, the test has enough power to detect tiny effects. But 0.02% CTR improvement translates to approximately 2,000 additional clicks per day across 10 million users, which may not justify the engineering cost of maintaining a more complex model. The question is always: "If this improvement is real, would it change any decision?" If the answer is no, statistical significance is irrelevant.

Question 17

Why should you expect paper results to degrade by 20-40% when transferred to production?

Answer

Paper results are produced under idealized conditions that systematically favor higher performance: **(1) Clean data** — benchmark datasets are curated; production data has missing values, label noise, and inconsistencies. **(2) Static evaluation** — benchmarks are fixed; production data undergoes distribution shift over time. **(3) Unlimited tuning** — papers tune hyperparameters extensively on the benchmark; production systems operate under tighter tuning budgets. **(4) Matched distribution** — training and test data in benchmarks come from the same distribution; production data may differ from any benchmark. **(5) No operational constraints** — papers do not face latency requirements, cost budgets, or model size limitations. Each of these factors contributes a fraction of the degradation, and they compound. The 20-40% range is an empirical heuristic, not a theorem — the actual degradation depends on the specific gaps between paper conditions and production conditions. Some methods transfer well (< 10% degradation); others fail completely (> 50% degradation). The heuristic provides a reasonable default expectation.

Question 18

What is the difference between the role of NeurIPS/ICML/ICLR (ML conferences) and JASA/Biometrika (statistics journals) in the causal inference literature?

Answer

**ML conferences (NeurIPS, ICML, ICLR, UAI)** publish causal inference work that emphasizes scalable algorithms, connections to machine learning (e.g., causal forests, causal representation learning, causal discovery from observational data at scale), and empirical evaluation on large datasets, often simulated. The review cycle is fast (3-6 months) and the standard is primarily empirical. **Statistics journals (JASA, Biometrika, Annals of Statistics, JRSS-B)** publish causal inference work that emphasizes theoretical properties (consistency, asymptotic normality, efficiency bounds, semiparametric theory), identification conditions, and careful treatment of assumptions. The review cycle is slow (12-24 months) and the standard is primarily theoretical. For practitioners, the ML venues provide scalable implementations, while the statistics journals provide the theoretical foundations that tell you when those implementations are valid. Reading only ML venues risks implementing methods without understanding their assumptions; reading only statistics journals risks theoretical knowledge without practical implementation guidance.

Question 19

You have completed a third-pass reading of a paper and decide to implement it. What are the four steps in the reading-to-implementation pipeline, and approximately how long does each take?

Answer

**(1) Reproduce the paper's key result (1-2 weeks).** Run the paper's code on the paper's dataset (or implement from scratch if no code is available) and verify that the result matches within 1-2% of the reported performance across at least three random seeds. If reproduction fails, stop — an irreproducible result is not a foundation for production engineering. **(2) Evaluate on your data (1-2 weeks).** Run the reproduced method on your production data or a representative sample. Expect degradation; the question is how much. Less than 10% degradation suggests good transferability; 10-30% suggests partial transfer requiring adaptation; more than 30% suggests the method does not transfer. **(3) Production prototype (2-4 weeks).** Implement the method within your production infrastructure — feature store integration, latency optimization, monitoring integration, deployment pipeline. This is where the infrastructure gap becomes concrete. **(4) A/B test (2-4 weeks).** Run a controlled experiment with pre-registered primary metric, CUPED variance reduction, sequential testing, and SRM checks. Only deploy if the A/B test confirms a meaningful improvement. Total pipeline: 6-12 weeks from paper reading to production deployment.

Question 20

The chapter states that "the purpose of reading papers is not to implement every one — it is to develop the judgment to identify the rare papers worth implementing." Explain why this judgment is more valuable than any individual implementation.

Answer

This judgment — sometimes called "research taste" — is more valuable than any individual implementation for three reasons. **(1) It prevents wasted effort.** The reading-to-implementation pipeline takes 6-12 weeks. A team that implements every promising paper will spend most of its time on methods that ultimately do not improve the production system. The judgment to *not* implement a paper — to recognize that its claims are inflated, its assumptions are violated in your context, or its improvement is too small to justify the effort — saves those 6-12 weeks for work that actually matters. **(2) It compounds over time.** Each paper you evaluate, whether you implement it or not, sharpens your ability to evaluate the next one. After evaluating 100 papers, you develop intuition for the patterns that predict genuine advances versus incremental or misleading results. This intuition cannot be taught directly — it must be developed through practice. **(3) It transfers across problems.** The ability to evaluate methodology, identify pitfalls, and assess production relevance applies to every paper in every subfield. An individual implementation helps with one specific problem; research judgment helps with every problem you will encounter throughout your career.