Case Study 2: Paper Walkthrough — "DoWhy: An End-to-End Library for Causal Inference" Using the Three-Pass Method

DataField.Dev

Case Study 2: Paper Walkthrough — "DoWhy: An End-to-End Library for Causal Inference" Using the Three-Pass Method

Context

Sharma and Kiciman, "DoWhy: An End-to-End Library for Causal Inference" (arXiv preprint, 2020; later published in JMLR 2025), describes a Python library that operationalizes the causal inference workflow: model the causal assumptions, identify the target estimand, estimate the causal effect, and refute the estimate through robustness checks. This paper sits at the intersection of causal inference (Part III of this book) and production systems (Part V), making it directly relevant to the MediCore pharma anchor example and to the causal evaluation of StreamRec.

Unlike the "Attention Is All You Need" walkthrough (Case Study 1), which evaluated a methods paper, this case study evaluates a systems/library paper — a different genre with different evaluation criteria. Library papers are judged less by novel methodology and more by whether the library makes a well-defined workflow more rigorous, accessible, and reproducible.

Pass 1: The Survey Pass (6 minutes)

Title and abstract. The title names the library. The abstract claims that DoWhy implements a four-step causal inference workflow: (1) model assumptions as a causal graph, (2) identify the estimand using graph-based criteria, (3) estimate the effect using any of several methods, and (4) refute the estimate with sensitivity analysis and placebo tests. The abstract emphasizes that the library makes causal assumptions explicit — users must specify the graph, which forces them to articulate what they believe about the data-generating process.

Section headings. Introduction, Background (Potential Outcomes and Causal Graphs), The DoWhy Workflow, Identification, Estimation, Refutation, Example, Related Work, Conclusion. The structure follows the library's workflow, not a traditional methods-paper structure.

Figures and tables. Figure 1 shows the four-step workflow diagram. Figure 2 shows a causal graph example. No main results table comparing against baselines — this is expected for a library paper that is not claiming a novel estimation method.

Conclusion. Discusses the library's design philosophy (make assumptions explicit, provide multiple estimation methods, always refute) and future directions (integration with EconML for heterogeneous treatment effects, support for time-series causal inference).

References. Cites the foundational causal inference literature: Pearl (2009), Rubin (1974), Imbens and Rubin (2015), Robins et al. (1994), VanderWeele and Ding (2017) for sensitivity analysis. The reference list is appropriate and comprehensive.

First-pass assessment:

from dataclasses import dataclass, field
from typing import List
from enum import Enum


class PaperRelevance(Enum):
    HIGHLY_RELEVANT = "highly_relevant"
    POTENTIALLY_RELEVANT = "potentially"
    NOT_RELEVANT = "not_relevant"
    CANNOT_DETERMINE = "need_second_pass"


class MethodologyFlag(Enum):
    NO_ABLATION = "no_ablation_study"
    WEAK_BASELINES = "weak_or_outdated_baselines"
    NO_SIGNIFICANCE = "no_statistical_significance_reported"
    SINGLE_DATASET = "single_dataset_evaluation"
    NO_CODE = "no_code_or_reproduction_details"
    NONE = "no_flags"


@dataclass
class FirstPassAssessment:
    """Structured first-pass assessment."""
    title: str
    authors: List[str]
    venue: str
    year: int
    category: str
    relevance: PaperRelevance
    main_contribution: str
    flags: List[MethodologyFlag] = field(default_factory=list)
    decision: str = "skip"
    notes: str = ""


dowhy_pass1 = FirstPassAssessment(
    title="DoWhy: An End-to-End Library for Causal Inference",
    authors=["Sharma", "Kiciman"],
    venue="arXiv / JMLR",
    year=2020,
    category="causal inference / software",
    relevance=PaperRelevance.HIGHLY_RELEVANT,
    main_contribution=(
        "A Python library that operationalizes the full causal inference "
        "workflow — modeling, identification, estimation, and refutation — "
        "with explicit causal assumptions and built-in robustness checks."
    ),
    flags=[MethodologyFlag.NONE],
    decision="second_pass",
    notes=(
        "Library paper, not methods paper — different evaluation criteria. "
        "Authors from Microsoft Research. Code publicly available (GitHub). "
        "No novel estimation methods; contribution is the workflow and API design. "
        "Directly relevant to MediCore causal analysis and StreamRec causal evaluation. "
        "Proceed to second pass for API evaluation and refutation analysis."
    ),
)
print(f"Decision: {dowhy_pass1.decision}")

The first pass identifies this as a library paper, which shifts evaluation criteria from "does this method beat baselines?" to "does this library make the workflow more rigorous and less error-prone?" No red flags for the genre.

Pass 2: The Comprehension Pass (35 minutes)

The Four-Step Workflow

Step 1: Model. Users specify a causal graph using dowhy.CausalModel(data, treatment, outcome, graph). The graph encodes the causal assumptions — which variables cause which, which are confounders, which are mediators. The library then validates the graph against the data (checking that specified variables exist and that the graph is a valid DAG).

This is the paper's most important design decision: requiring users to specify a causal graph. Most causal inference libraries (at the time of publication) allowed users to jump directly to estimation — regress outcome on treatment with some covariates — without articulating the assumptions that justify the estimate. DoWhy makes the assumptions explicit and auditable.

Step 2: Identify. Given the graph, DoWhy automatically derives the estimand — the statistical quantity that, under the causal assumptions, equals the causal effect. It uses the backdoor criterion, front-door criterion, and instrumental variable criterion (all covered in Chapter 17 of this book). If no valid identification strategy exists under the specified graph, the library reports this — preventing estimation under unjustified assumptions.

Step 3: Estimate. Given the identified estimand, DoWhy estimates the causal effect using one of several methods: propensity score matching, inverse probability weighting, instrumental variables, regression discontinuity, or doubly robust estimation. It also integrates with EconML for heterogeneous effect estimation (CATE via causal forests, DML, etc.).

Step 4: Refute. This is the step that most practitioners skip when working without a structured library. DoWhy provides automated refutation tests:

Placebo treatment: Replace the treatment with a random variable. If the estimated effect persists, the original estimate is suspect.
Random common cause: Add a random variable as a confounder. If the estimate changes substantially, it is sensitive to confounding specification.
Data subset refuter: Re-estimate on random subsets of the data. If the estimate varies widely, it is not robust.
Unobserved common cause (sensitivity analysis): How strong would an unmeasured confounder need to be to reduce the estimated effect to zero? This implements the sensitivity analysis framework of Ding and VanderWeele (2016), which Chapter 17 discussed in the context of MediCore's clinical trial.

Evaluation Criteria for a Library Paper

Since this is a library paper, we evaluate it on different criteria than a methods paper:

Criterion	Assessment
API clarity	Strong. The four-step workflow maps naturally to the conceptual framework.
Assumption transparency	Excellent. The graph requirement forces explicit assumptions.
Estimation flexibility	Good. Multiple estimators available; integrates with EconML.
Refutation automation	Excellent. Automated placebo, sensitivity, and subset tests.
Documentation	Good at time of publication; substantially improved since.
Maintenance	Active. Maintained by Microsoft Research; regular releases.
Testing	Adequate. Unit tests present; could benefit from more edge case coverage.
Performance at scale	Not addressed. Paper does not discuss computational cost or scalability.

Limitations Identified

Scalability. The paper does not discuss computational performance for large datasets. For MediCore's clinical trial data (thousands of patients), this is not a concern. For StreamRec's user engagement data (millions of users), propensity score methods and sensitivity analysis may become computationally prohibitive without careful implementation.
Graph specification burden. Requiring a causal graph is the paper's greatest strength and its greatest practical barrier. Practitioners who do not know the causal structure — which is most practitioners, most of the time — may specify incorrect graphs, leading to incorrect identification and biased estimates. The library does not help users discover the graph; it helps them use a graph they have already specified.
No time-series support (at time of publication). Causal inference with temporal data — treatment effects that unfold over time, time-varying confounders, dynamic treatment regimes — was not supported. This limits applicability to the StreamRec temporal engagement analysis and to MediCore's longitudinal follow-up data.
Refutation is necessary but not sufficient. The refutation tests provide evidence of robustness but cannot prove that the causal estimate is correct. A placebo treatment test that returns zero does not prove the original treatment effect is real — it proves the model is not detecting effects where none exist, which is a weaker claim.

Production Relevance Assessment

@dataclass
class PaperToProductionAssessment:
    """Production relevance assessment."""
    paper_title: str
    paper_result: str = ""
    production_context: str = ""
    data_gap: str = ""
    scale_gap: str = ""
    infrastructure_gap: str = ""
    decision: str = "prototype_first"


dowhy_production = PaperToProductionAssessment(
    paper_title="DoWhy: An End-to-End Library for Causal Inference",
    paper_result=(
        "Structured causal inference workflow with mandatory assumption "
        "specification and automated robustness checks."
    ),
    production_context=(
        "MediCore: 12-site clinical trial, thousands of patients, "
        "federated data. StreamRec: causal evaluation of recommendation "
        "impact on engagement, millions of users."
    ),
    data_gap=(
        "MediCore: minimal gap — clinical trial data is relatively clean. "
        "StreamRec: moderate gap — observational user data has unmeasured "
        "confounders (user intent, external events) that complicate graph "
        "specification."
    ),
    scale_gap=(
        "MediCore: no gap — clinical datasets are small enough for all methods. "
        "StreamRec: significant gap — propensity score methods on millions "
        "of users require careful subsampling or approximate methods."
    ),
    infrastructure_gap=(
        "Both: DoWhy is a Python library that integrates with pandas. "
        "No special infrastructure required. Can be added to existing "
        "analysis pipelines with minimal engineering effort."
    ),
    decision="adopt",
)
print(f"Decision: {dowhy_production.decision}")

Comparative Reflection: Methods Paper vs. Library Paper

This case study illustrates how the three-pass strategy adapts to different paper genres:

Dimension	"Attention Is All You Need" (Methods)	"DoWhy" (Library)
Primary evaluation	Does the method outperform baselines?	Does the library make the workflow more rigorous?
Baseline question	Are baselines fair and current?	Does the library improve on manual practice?
Ablation question	Which components contribute most?	Which workflow steps add most value?
Reproducibility question	Can results be reproduced?	Can the library be installed and used?
Production question	Does the method scale to production?	Does the library integrate with production pipelines?
Red flag patterns	Unfair baselines, leakage, cherry-picking	Poor API design, missing edge cases, unmaintained

The DoWhy paper's most significant contribution is not a novel algorithm — it is a process design. By making the four-step workflow (model, identify, estimate, refute) the mandatory structure of a causal analysis, the library prevents the most common practitioner error: jumping directly from data to causal claims without articulating assumptions or testing robustness. This is a contribution to the practice of causal inference, not to the theory, and it should be evaluated accordingly.

For the MediCore anchor example, DoWhy directly operationalizes the causal estimation pipeline from Chapters 16-19: the causal DAG from Chapter 17 becomes the graph input, the IPW and doubly robust estimators from Chapter 18 become the estimation step, and the sensitivity analysis from Chapter 17 becomes an automated refutation test. The library does not change what you do — it changes how rigorously and reproducibly you do it.

For the StreamRec causal evaluation (Chapter 15), DoWhy is useful for the observational analysis but does not replace the A/B testing infrastructure from Chapter 33. The correct production workflow is: use DoWhy for offline causal estimation (with appropriate caveats about unmeasured confounders), then validate with a randomized experiment. The library strengthens the offline analysis; it does not substitute for experimental evidence.

The paper passes the three-pass evaluation with no red flags for its genre. The decision is to adopt — not because it introduces a new method, but because it enforces a disciplined workflow that reduces the risk of unjustified causal claims.