Chapter 37: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Three-Pass Reading Strategy

Exercise 37.1 (*)

Select a paper from the most recent proceedings of RecSys, KDD, or NeurIPS. Apply the first-pass reading strategy (5-10 minutes). Fill in a FirstPassAssessment dataclass instance for the paper, including all five classification fields: relevance, main contribution, methodology type, flags, and decision.

(a) Write your first-pass assessment.

(b) Without reading further, predict whether the paper's core result would transfer to the StreamRec production context (50M MAU, 200ms latency, 200K items). State your prediction and your confidence level (low/medium/high).

(c) Now perform a second pass. Did the second pass change your prediction from part (b)? If so, what information from the second pass was decisive?


Exercise 37.2 (*)

Apply the three-pass strategy to Keshav's "How to Read a Paper" (2007) itself. Yes, this is meta — reading the paper about reading papers using the method it describes. The point is to practice the mechanics on a short, accessible paper before applying them to dense technical work.

(a) First pass (5 minutes). What is the main contribution? What is the methodology type (theoretical, empirical, both)? What red flags, if any, do you observe?

(b) Second pass (15 minutes). In your own words, what is the key insight that makes the three-pass strategy more efficient than reading a paper linearly from start to finish?

(c) Is there any experiment, comparison, or validation in the paper? If not, does this weakness invalidate the paper's contribution? Why or why not?


Exercise 37.3 (**)

Select three papers from the same subfield (e.g., three papers on session-based recommendation, or three papers on causal forest estimation). Perform a first pass on all three. Then, using only your first-pass notes, rank the three papers by expected methodological quality.

Perform a second pass on all three. Re-rank them by actual methodological quality.

(a) How well did your first-pass ranking predict your second-pass ranking?

(b) What signals in the first pass were most predictive of methodological quality?

(c) What signals in the first pass were misleading?


Exercise 37.4 (*)

Find a blog post or Twitter/X thread that summarizes a recent ML paper. Read the blog post first, then read the actual paper using the three-pass strategy.

(a) List three claims from the blog post. For each, identify whether the claim is (i) accurately represented in the paper, (ii) overstated relative to the paper's evidence, or (iii) absent from the paper entirely.

(b) What limitations discussed in the paper were omitted from the blog post?

(c) Write a two-paragraph summary of the paper that you consider more accurate than the blog post.


Evaluating Methodology

Exercise 37.5 (**)

The following table presents results from a hypothetical recommendation system paper. Identify all methodological concerns.

Method Hit@10 NDCG@10 Recall@50
Pop (popularity baseline) 0.041 0.019 0.087
MF (matrix factorization) 0.112 0.054 0.198
GRU4Rec (2016) 0.134 0.062 0.231
SASRec (2018) 0.148 0.071 0.256
OurMethod 0.167 0.081 0.289

(a) The paper was published in 2024. What is the most obvious concern about the baselines?

(b) The paper reports single numbers with no confidence intervals or standard deviations. What is the concern?

(c) The paper evaluates on MovieLens-1M only. What is the concern?

(d) If you were a reviewer, what additional experiments would you require before recommending acceptance?


Exercise 37.6 (**)

A paper claims that their new transformer variant improves BLEU on WMT 2014 En-De by 0.3 points over the standard Transformer (from 28.4 to 28.7). The paper uses a single training run.

(a) Vaswani et al. (2017) did not report standard deviations. Based on subsequent reproduction studies, the standard deviation of BLEU across random seeds for Transformer-base on WMT En-De is approximately 0.2-0.4 BLEU points. Given this, is the claimed 0.3-point improvement statistically meaningful?

(b) Write a paired_bootstrap_test call that would properly test this claim, assuming per-sentence BLEU scores are available for both models on the same test set.

(c) What is the minimum number of random seeds the paper should have used to distinguish a 0.3-point improvement from noise, given the estimated variance?


Exercise 37.7 (*)

Define "ablation study" in your own words. Then answer: why is an ablation study more informative than a comparison to baselines?


Exercise 37.8 (**)

A paper proposes a recommendation method with four components: (A) a graph encoder, (B) a contrastive loss, (C) a temporal positional encoding, and (D) a diversity re-ranking layer. The paper reports the following ablation:

Configuration Hit@20
Full method (A+B+C+D) 0.231
Remove A (B+C+D) 0.195
Remove B (A+C+D) 0.204
Remove C (A+B+D) 0.227
Remove D (A+B+C) 0.229

(a) Rank the four components by their marginal contribution to performance.

(b) Component C contributes 0.004 Hit@20 and Component D contributes 0.002 Hit@20. If implementing the full method requires 8 weeks of engineering and implementing A+B only requires 3 weeks, what is the practical recommendation?

(c) The ablation does not test interaction effects (e.g., what happens when both A and B are removed). Design an ablation table that would test the interaction between A and B.


Exercise 37.9 (***)

Using the BaselineAudit dataclass from Section 37.3, audit the baselines of any paper from the most recent RecSys proceedings. Compute the fairness_score().

(a) Report the score and explain which dimensions contributed most to lowering or raising it.

(b) Identify at least one baseline that the paper should have included but did not.

(c) If you were to add that baseline, what would you expect the results table to look like? Justify your prediction.


Identifying Pitfalls

Exercise 37.10 (*)

For each of the following scenarios, identify which pitfall (dataset leakage, unfair baselines, cherry-picking, p-hacking, or publication bias) is most likely at play.

(a) A fraud detection model achieves 0.997 AUC on the test set, while the best published result on the same dataset is 0.92 AUC.

(b) A paper reports results on Yelp and Amazon Reviews but not on MovieLens or Gowalla, despite these being standard benchmarks for the paper's task.

(c) A paper proposes a new regularization technique and compares it against an unregularized baseline, a baseline with L2 regularization (with default lambda), and their proposed method (with lambda tuned via Bayesian optimization).

(d) A researcher tries 50 feature combinations and reports the one that produces a significant result (p < 0.05).

(e) You search for papers on a specific method and find seven papers reporting positive results. No papers report negative results.


Exercise 37.11 (**)

A clinical trial paper reports that a drug reduces hospital readmission rates from 18% to 14% (p = 0.04). The study has 200 patients per arm.

(a) Calculate the 95% confidence interval for the absolute risk reduction.

(b) The paper reports three secondary outcomes alongside the primary outcome. What statistical concern does this raise?

(c) How would you determine whether the primary outcome was pre-registered or chosen post-hoc?


Exercise 37.12 (**)

Design a LeakageDetector class that checks a pandas DataFrame for common forms of dataset leakage. At minimum, implement checks for:

(a) Temporal leakage: features that are chronologically later than the target variable.

(b) Group leakage: entities (e.g., user IDs) that appear in both train and test splits.

(c) Target leakage: features that are derived from the target variable.

from dataclasses import dataclass, field
from typing import List, Dict, Tuple, Optional
import pandas as pd
import numpy as np


@dataclass
class LeakageReport:
    """Report from a leakage detection check.

    Attributes:
        check_name: Name of the leakage check.
        leakage_detected: Whether leakage was detected.
        details: Human-readable description of findings.
        affected_features: List of features involved in leakage.
        severity: Severity level ('low', 'medium', 'high').
    """
    check_name: str
    leakage_detected: bool
    details: str
    affected_features: List[str] = field(default_factory=list)
    severity: str = "low"


class LeakageDetector:
    """Detects common forms of dataset leakage.

    Checks for temporal leakage, group leakage, and
    suspiciously high feature-target correlations that
    may indicate target leakage.

    Attributes:
        df_train: Training DataFrame.
        df_test: Test DataFrame.
        target_col: Name of the target column.
        timestamp_col: Name of the timestamp column (if applicable).
        entity_col: Name of the entity/group column (if applicable).
    """

    def __init__(
        self,
        df_train: pd.DataFrame,
        df_test: pd.DataFrame,
        target_col: str,
        timestamp_col: Optional[str] = None,
        entity_col: Optional[str] = None,
    ):
        self.df_train = df_train
        self.df_test = df_test
        self.target_col = target_col
        self.timestamp_col = timestamp_col
        self.entity_col = entity_col

    def check_temporal_leakage(self) -> LeakageReport:
        """Check if training data contains timestamps after test data.

        Returns:
            LeakageReport with temporal leakage findings.
        """
        if self.timestamp_col is None:
            return LeakageReport(
                check_name="temporal_leakage",
                leakage_detected=False,
                details="No timestamp column specified; skipping check.",
            )

        train_max = self.df_train[self.timestamp_col].max()
        test_min = self.df_test[self.timestamp_col].min()

        if train_max > test_min:
            overlap_count = (
                self.df_train[self.timestamp_col] > test_min
            ).sum()
            return LeakageReport(
                check_name="temporal_leakage",
                leakage_detected=True,
                details=(
                    f"Training data contains {overlap_count} rows with "
                    f"timestamps after the earliest test timestamp "
                    f"({test_min}). Train max: {train_max}."
                ),
                affected_features=[self.timestamp_col],
                severity="high",
            )

        return LeakageReport(
            check_name="temporal_leakage",
            leakage_detected=False,
            details="No temporal overlap detected.",
        )

    def check_group_leakage(self) -> LeakageReport:
        """Check if entities appear in both train and test sets.

        Returns:
            LeakageReport with group leakage findings.
        """
        if self.entity_col is None:
            return LeakageReport(
                check_name="group_leakage",
                leakage_detected=False,
                details="No entity column specified; skipping check.",
            )

        train_entities = set(self.df_train[self.entity_col].unique())
        test_entities = set(self.df_test[self.entity_col].unique())
        overlap = train_entities & test_entities

        if overlap:
            overlap_fraction = len(overlap) / len(test_entities)
            return LeakageReport(
                check_name="group_leakage",
                leakage_detected=True,
                details=(
                    f"{len(overlap)} entities ({overlap_fraction:.1%} of "
                    f"test entities) appear in both train and test sets."
                ),
                affected_features=[self.entity_col],
                severity="high" if overlap_fraction > 0.5 else "medium",
            )

        return LeakageReport(
            check_name="group_leakage",
            leakage_detected=False,
            details="No entity overlap between train and test.",
        )

    def check_target_correlation(
        self,
        threshold: float = 0.95,
    ) -> LeakageReport:
        """Check for features with suspiciously high target correlation.

        A feature with correlation > threshold to the target may
        be derived from the target (target leakage).

        Args:
            threshold: Correlation threshold for flagging.

        Returns:
            LeakageReport with target correlation findings.
        """
        numeric_cols = self.df_train.select_dtypes(include=[np.number]).columns
        numeric_cols = [c for c in numeric_cols if c != self.target_col]

        suspicious = []
        for col in numeric_cols:
            corr = self.df_train[col].corr(self.df_train[self.target_col])
            if abs(corr) > threshold:
                suspicious.append((col, corr))

        if suspicious:
            details_parts = [
                f"{col}: r={corr:.3f}" for col, corr in suspicious
            ]
            return LeakageReport(
                check_name="target_correlation",
                leakage_detected=True,
                details=(
                    f"{len(suspicious)} feature(s) with |correlation| > "
                    f"{threshold}: {'; '.join(details_parts)}"
                ),
                affected_features=[col for col, _ in suspicious],
                severity="high",
            )

        return LeakageReport(
            check_name="target_correlation",
            leakage_detected=False,
            details=f"No features with |correlation| > {threshold}.",
        )

    def run_all_checks(self) -> List[LeakageReport]:
        """Run all leakage checks and return reports.

        Returns:
            List of LeakageReport instances.
        """
        return [
            self.check_temporal_leakage(),
            self.check_group_leakage(),
            self.check_target_correlation(),
        ]

Test your LeakageDetector on a synthetic dataset that you construct with intentional leakage in each category.


Exercise 37.13 (*)

Explain the difference between replication and reproduction in the context of ML research. Which is more common, and why?


Exercise 37.14 (**)

Raff (2019) found that 63.5% of ML papers could be independently reproduced. Suppose you are evaluating a paper without released code that claims a 5% improvement over the previous state of the art on a standard benchmark.

(a) Using Raff's reproduction rate as a prior, what is your prior probability that the result is reproducible?

(b) Now suppose the paper includes a detailed hyperparameter table, specifies library versions, and describes the evaluation protocol in an appendix. Raff found that these factors increase reproducibility to approximately 80%. Update your estimate.

(c) Now suppose the paper is from a group that has released code for their last five papers, all of which reproduced. Update your estimate again.


Building Reading Practice

Exercise 37.15 (*)

Create a AnnotatedBibEntry for a paper you have read in the last month. Fill in all fields. Time yourself — the goal is to complete the entry in under 10 minutes.


Exercise 37.16 (**)

Design a reading schedule for the next four weeks. Identify:

(a) Three arXiv categories or conference proceedings you will monitor.

(b) The number of first-pass, second-pass, and third-pass papers you plan to read each week.

(c) One paper you intend to third-pass read. What makes this paper worth the 2-5 hour investment?

(d) Whether you will start or join a paper reading group. If yes, propose a format (frequency, presentation length, discussion structure).


Exercise 37.17 (*)

Explain what "research taste" means in the context of this chapter. Give one example of a paper that demonstrated good research taste (identified an important problem or provided a surprisingly simple solution) and one example of a paper that represents incremental work (marginal improvement on existing methods).


Paper-to-Production Gap

Exercise 37.18 (**)

Select any paper that proposes a new recommendation algorithm. Using the PaperToProductionAssessment dataclass from Section 37.6, evaluate the paper for the StreamRec production context.

(a) Identify the data gap, scale gap, evaluation gap, infrastructure gap, and maintenance gap.

(b) Estimate the total implementation effort in engineering weeks.

(c) Compute the effort_adjusted_value() summary. Would you recommend implementation, prototyping, or skipping?


Exercise 37.19 (***)

A paper reports that a new training method reduces model training time by 40% with no accuracy loss. You want to adopt this method for StreamRec's weekly retraining pipeline (currently 8 GPU-hours on 4 A100s).

(a) What is the expected time saving per training run? What is the annual cost saving, assuming weekly retraining and $3.50/GPU-hour for A100 on-demand pricing?

(b) The paper evaluates on ImageNet classification. StreamRec uses a two-tower retrieval model with contrastive loss. List three reasons the training time reduction might not transfer.

(c) Design a two-week prototype plan to evaluate whether the method works for StreamRec. What is your success criterion?

(d) If the prototype succeeds, what additional steps are needed before production deployment? Estimate the timeline.


Exercise 37.20 (**)

You read a paper proposing a new fairness intervention for recommendation systems. The paper shows that their method reduces exposure disparity by 35% with only a 2% decrease in NDCG@20.

(a) Using the fairness framework from Chapter 31, what additional fairness metrics would you need to evaluate before adopting this method?

(b) The paper evaluates on a static dataset. Why is this insufficient for evaluating a fairness intervention in production?

(c) Design an A/B test (using the methodology from Chapter 33) that would validate this fairness intervention in the StreamRec production system. Specify the primary metric, guardrail metrics, and sample size considerations.


Advanced / Research-Level

Exercise 37.21 (***)

Implement a PaperComparisonReview for two papers on the same topic. Choose two papers on causal effect estimation (e.g., one using doubly robust estimation and one using causal forests) or two papers on session-based recommendation (e.g., one using transformers and one using graph neural networks).

Complete all five evaluation components for each paper (first-pass, second-pass notes, pitfall audit, baseline audit, production assessment). Then write the comparative analysis (500-1000 words).


Exercise 37.22 (***)

A colleague proposes implementing a paper that claims to improve click-through rate prediction by 8% using a complex feature interaction network. The paper evaluates on Criteo and Avazu datasets.

(a) The Criteo dataset has known issues (class imbalance, feature anonymization). How do these issues affect the transferability of the reported 8% improvement?

(b) The paper compares against DeepFM, DCN, and AutoInt. Are these baselines sufficient for a paper published in 2024? What baselines are missing?

(c) Write a one-page memo to your team lead recommending for or against implementation. Use the paper-to-production framework from Section 37.6. Include estimated effort, expected production lift, and risk factors.


Exercise 37.23 (****)

The reproducibility crisis in ML has been documented by Raff (2019), Dodge et al. (2019), and Bouthillier et al. (2021). Design a reproducibility audit protocol for your team.

(a) Define "reproducible" in your production context. What tolerance do you accept for metric differences?

(b) Design a checklist that every paper must pass before your team invests engineering time in implementation. The checklist should cover: code availability, data availability, hyperparameter disclosure, compute disclosure, statistical significance, and evaluation protocol.

(c) Estimate the cost (in person-hours) of applying this checklist to a single paper. Is the cost justified by the reduction in wasted engineering effort?


Exercise 37.24 (****)

Meta-analysis is a statistical method for combining results across multiple studies. It is standard in medicine and social science but rare in machine learning.

(a) Explain why meta-analysis is difficult in ML. Consider: different datasets, different evaluation protocols, different hyperparameter tuning procedures, and different reporting standards.

(b) Despite these difficulties, some ML meta-analyses exist (e.g., Rendle et al., 2020, "Neural Collaborative Filtering vs. Matrix Factorization Revisited"). Read Rendle et al. and summarize their finding. How did it change the field's understanding of neural collaborative filtering?

(c) Propose a meta-analysis for a topic of your choice (e.g., "Do graph neural networks improve recommendation over non-graph methods?"). What papers would you include? What would be your primary outcome measure? How would you handle heterogeneity across studies?


Exercise 37.25 (****)

Publication bias means the published literature is a biased sample of all experiments conducted. Propose a mechanism for your organization to benefit from negative results — experiments that showed a method did not work.

(a) Design an internal "negative results registry" — a structured database of methods your team tried that did not produce the expected improvement. What fields should each entry contain?

(b) How would you incentivize engineers to contribute to this registry? (Consider: most organizations reward positive results; contributing negative results requires changing incentive structures.)

(c) Estimate the value of this registry over one year. How many engineering weeks might it save by preventing duplicate negative experiments?


Progressive Project

Exercise 37.26 (***)

Complete the progressive project milestone described in Section 37.12. Select two recommendation system papers from the last three years. For each paper, produce the five structured evaluations (first-pass, second-pass, pitfall audit, baseline audit, production assessment). Then write the comparative analysis.

This exercise is the chapter's primary deliverable. Budget 6-10 hours.


Exercise 37.27 (**)

After completing Exercise 37.26, reflect on your reading process.

(a) How long did each pass take for each paper? Was this consistent with the time estimates in Section 37.2?

(b) Which evaluation dimension (baselines, ablation, significance, reproducibility, production gap) was most useful for discriminating between the two papers?

(c) Would your recommendation change if the StreamRec latency budget were 500ms instead of 200ms? If the team had 2 ML engineers instead of 5?


Exercise 37.28 (**)

Create annotated bibliography entries (AnnotatedBibEntry) for both papers from Exercise 37.26 and for at least three of the papers cited in the Further Reading section of this chapter. Store them in a structured format (JSON, YAML, or a dataclass-based Python file) that you can search and filter.


Exercise 37.29 (***)

You have completed the critical review from Exercise 37.26 and recommended one paper for implementation. Write a 1-page implementation proposal that includes:

(a) The core insight to implement (which may be a subset of the full paper's method).

(b) A reproduction plan (1-2 weeks) with specific success criteria.

(c) A production prototype plan (2-4 weeks) with infrastructure requirements.

(d) An A/B test plan with primary metric, guardrail metrics, MDE, and required sample size.

(e) A rollback plan if the A/B test fails.


Exercise 37.30 (**)

The chapter argues that "most papers you read will not be implemented" and that "the decision not to implement a paper is as valuable as the decision to implement one." Write a one-paragraph argument for why this is true, drawing on your experience from this chapter's exercises. Then write a one-paragraph counterargument — a scenario where a team reads too many papers and implements too few.