Chapter 27: Quiz

DataField.Dev

Chapter 27: Quiz

Test your understanding of ML pipeline orchestration. Answers follow each question.

Question 1

What is a directed acyclic graph (DAG), and why is the "acyclic" constraint essential for pipeline orchestration?

Answer

A **directed acyclic graph (DAG)** is a graph with directed edges and no cycles — there is no sequence of edges that leads from a node back to itself. The acyclic constraint is essential because it guarantees a valid **topological ordering**: an execution sequence where every task runs only after all of its dependencies have completed. If cycles were allowed, there would be no valid execution order (task A waits for B, B waits for A — deadlock). The acyclic property enables the orchestrator to compute the correct execution order, identify parallelism opportunities, and propagate failure states through the dependency chain.

Question 2

What is the difference between a task's logical date (data interval) and its execution date? Why does this distinction matter for idempotency?

Answer

The **logical date** (data interval) is the time period that the pipeline processes — e.g., "March 14" for a daily pipeline. The **execution date** is when the pipeline actually runs — e.g., "March 15 at 2:00am." For a daily pipeline running at 2am on March 15, the logical date is March 14 and the execution date is March 15 at 02:00. This distinction matters for **idempotency** because a retry or backfill must process the same data interval (March 14) regardless of when it runs. If the pipeline used wall-clock time instead of the logical date, a retry at 5am would process different data than the original run at 2am, violating idempotency. Airflow calls this "data interval," Dagster calls it "partition key," and Prefect handles it via flow parameters.

Question 3

Describe the fundamental philosophical difference between Airflow's task-centric model and Dagster's asset-centric model.

Answer

**Airflow** is **imperative and task-centric**: you define tasks (units of computation) and explicitly specify their execution order. The orchestrator's job is to run those tasks in the declared sequence. Airflow answers the question "What tasks should I run?" **Dagster** is **declarative and asset-centric**: you define data assets (persistent outputs like tables, files, model artifacts) and the functions that produce them. The DAG is inferred automatically from asset dependencies — if asset B's function takes asset A as input, Dagster knows A must be materialized before B. Dagster answers the question "What data should exist?" The practical consequence is that Dagster has built-in data lineage (it knows what data flows between stages), while Airflow only knows about task-to-task dependencies (it does not inherently track the data that flows through XCom or external storage).

Question 4

What is an Airflow Sensor, and when would you use one instead of a hard-coded schedule dependency?

Answer

An Airflow **Sensor** is a special operator that repeatedly checks (pokes) for an external condition — file existence, partition availability, API response, another DAG's completion — and blocks the downstream task until the condition is met. For example, `ExternalTaskSensor` waits for a task in another DAG to complete. Use a sensor instead of a hard-coded schedule dependency when the upstream event has **variable timing**. If pipeline A "usually finishes by 2:30am" and pipeline B is scheduled for 3:00am, a 15-minute delay in A causes B to process incomplete data. A sensor eliminates this race condition by explicitly waiting for A's completion, regardless of when it occurs. This is the anti-pattern the chapter calls "the schedule waterfall."

Question 5

Explain Airflow's XCom mechanism. What is its primary limitation, and how does Dagster's IO Manager address that limitation?

Answer

**XCom** (Cross-Communication) is Airflow's mechanism for passing small data between tasks. A task pushes a value to XCom (typically by returning from a `PythonOperator` callable), and downstream tasks pull it. The primary limitation is **size**: the default XCom backend stores data in the Airflow metadata database with a practical limit of ~48KB. Large datasets (training DataFrames, model artifacts) must be stored externally (S3, GCS), with only a reference (path) passed through XCom. Dagster's **IO Manager** addresses this by decoupling computation from storage. An IO Manager is a resource that handles loading and storing assets — the `S3ParquetIOManager` in the chapter writes DataFrames to S3 and reads them back transparently. The asset function simply returns a DataFrame; the IO Manager handles persistence. This means storage format and location can be changed by swapping the IO Manager resource (local filesystem for dev, S3 for production) without modifying asset code.

Question 6

What does it mean for a pipeline task to be idempotent, and why is idempotency critical for retries and backfills?

Answer

A task is **idempotent** if executing it multiple times with the same input produces the same output and the same side effects. For example, writing a Parquet file to a partition-specific path is idempotent (each run overwrites the previous output with identical content), while appending rows to a database table is not (each run adds duplicate rows). Idempotency is critical for **retries** because a failed task may have partially completed (e.g., written half the output). A retry must produce the correct complete output, not build on the partial result. Idempotency is critical for **backfills** because the same logical date may be processed multiple times (initial run, re-run after a bug fix). Without idempotency, re-processing accumulates duplicate data, corrupted artifacts, or inconsistent state.

Question 7

Name three patterns for achieving idempotency in pipeline tasks.

Answer

1. **Write-then-rename (atomic writes):** Write output to a temporary location, then atomically rename to the final path. If the task fails before the rename, the temporary file is cleaned up; if it succeeds, the output is complete. On systems without atomic rename (like S3), use copy-then-delete. 2. **Partition-based overwrites:** Write output to a deterministic, partition-specific path (e.g., `/data/2025-03-14/output.parquet`) and overwrite the entire partition on each run. Re-runs replace partial outputs with complete ones. 3. **Deterministic artifact paths:** Derive artifact paths from the logical date and configuration hash — not from timestamps, random UUIDs, or execution-time values. This ensures that re-running with the same inputs writes to the same location, overwriting rather than duplicating.

Question 8

What is exponential backoff, and what is its mathematical formula? Why is jitter added?

Answer

**Exponential backoff** is a retry strategy where the delay between retries doubles with each attempt. The formula for the delay before attempt $k$ is: $$d_k = \min\left(d_0 \cdot 2^{k-1},\; d_{\max}\right)$$ where $d_0$ is the base delay and $d_{\max}$ is the maximum delay cap. **Jitter** (random noise added to the delay) prevents the **thundering herd problem**: if multiple tasks fail simultaneously (e.g., due to a shared dependency outage), without jitter they would all retry at exactly the same time, potentially overloading the recovered dependency. Jitter spreads the retries over a time window, reducing the peak load on the recovering system.

Question 9

What is a dead letter queue in the context of data pipelines? When should a DLQ be used instead of failing the entire task?

Answer

A **dead letter queue (DLQ)** captures individual records that fail validation or processing, allowing the pipeline to continue with the remaining valid records. Failed records are stored for later investigation and potential reprocessing. A DLQ should be used when: (1) a small fraction of records are expected to fail (e.g., 0.1% of events have malformed fields), (2) the overall batch is still useful without the failed records, and (3) the failed records should be investigated but should not block the pipeline. The DLQ class in the chapter includes a `max_failure_rate` threshold — if the failure rate exceeds this threshold (e.g., >1%), the pipeline halts, because a high failure rate signals a systemic problem rather than isolated bad records.

Question 10

Describe the four alert severity levels defined in the chapter, and specify the appropriate notification channel for each.

Answer

| Severity | Description | Example | Notification Channel | |----------|-------------|---------|---------------------| | **P0 — Critical** | Pipeline halted after all retries exhausted | All 3 retry attempts for training failed | PagerDuty (page oncall) | | **P1 — High** | Pipeline skipped a day's training | Data validation failed, no model trained today | Slack #ml-alerts + oncall notification | | **P2 — Medium** | Pipeline completed but with degraded quality | Model trained but NDCG@20 dropped; model not promoted | Slack #ml-alerts (investigate within 24h) | | **P3 — Low** | Transient issue, self-resolved | Task failed on first attempt but succeeded on retry | Logged only (weekly review) | The key principle is that not all failures deserve the same urgency. Over-alerting (paging for P3 events) causes alert fatigue; under-alerting (logging P0 events) causes outages.

Question 11

What is backfill, and what are the three backfill strategies described in the chapter?

Answer

**Backfill** is the process of running a pipeline for historical data intervals that were missed (due to downtime) or need reprocessing (due to bug fixes, schema changes, or new model architectures). The three strategies are: 1. **Sequential:** Process each missing interval one at a time, in chronological order. Safe but slow. Required when tasks have temporal dependencies (today's rolling-window features depend on yesterday's outputs). 2. **Parallel:** Process multiple intervals simultaneously. Fast but resource-intensive. Appropriate when tasks are truly independent across partitions. 3. **Prioritized:** Process the most recent intervals first, then work backward. Ensures the current model is trained on the freshest data while historical gaps are filled asynchronously.

Question 12

Why is backfilling a pipeline with rolling-window features (e.g., 7-day moving averages) more complex than backfilling a pipeline with point-in-time features?

Answer

Rolling-window features create **cross-partition dependencies**: the features for date $t$ depend on data from dates $t-6$ through $t$. When backfilling dates March 8–14, the features for March 12 depend on data from March 5–11, which includes March 8–11 — dates that are themselves being backfilled. Parallel backfill would compute March 12's features using stale or missing data from March 8–11. The solution is to process backfill dates in **chronological order** with limited parallelism, ensuring that each date's rolling window dependencies are fully materialized before it is processed. Only dates that are at least $w$ days apart (where $w$ is the window size) can be safely parallelized.

Question 13

What are the three levels of pipeline testing described in the chapter?

Answer

1. **Unit tests:** Test individual task logic in isolation, without the orchestrator runtime. Tasks are called as pure functions with synthetic inputs. Example: testing that `validate_data_logic` raises `ValueError` when the null rate exceeds 5%. 2. **Integration tests:** Test the full pipeline end-to-end on synthetic data, verifying that data flows correctly through all stages. External dependencies (Feast, MLflow) are mocked or replaced with test containers. 3. **Contract tests:** Verify that the output of one task matches the expected input schema of the next task. A `DataContract` specifies required columns, types, null constraints, and minimum row counts. Contract tests catch schema evolution bugs — when one team changes a feature computation, the contract test fails if the downstream consumer does not expect the new schema.

Question 14

What is a data contract between pipeline stages, and how does it differ from a database schema constraint?

Answer

A **data contract** is a specification of the data that one pipeline stage promises to produce and the next stage expects to receive. It includes column names, data types, null constraints, minimum row counts, and value-level constraints (e.g., "probability must be between 0 and 1"). It differs from a database schema constraint in several ways: (1) it operates on DataFrames or files, not database tables; (2) it includes statistical constraints (minimum row count, null rate thresholds) beyond type constraints; (3) it is enforced at pipeline execution time, not at write time; (4) it is defined in code alongside the pipeline, not in the database DDL; and (5) violations can trigger pipeline-level responses (halt, alert, dead-letter) rather than simple write rejections.

Question 15

Explain the "Monolithic Task" anti-pattern and its consequences.

Answer

The **Monolithic Task** anti-pattern is a single task that performs all pipeline stages — extraction, validation, transformation, training, evaluation, and registration — in one function. When it fails at minute 90 of a 120-minute run, the entire 90 minutes of work must be repeated from scratch, because there are no intermediate checkpoints. Consequences: (1) wasted compute on retry (re-doing completed work); (2) difficult debugging (the failure could be in any stage); (3) no parallelism opportunities (the retrieval and ranking models cannot train concurrently); (4) no partial backfill (you cannot re-run just the feature computation without re-running extraction and validation). **Fix:** Decompose into granular, idempotent tasks with checkpointed intermediate outputs stored in a persistent location (S3, feature store). Each task can be retried, tested, and backfilled independently.

Question 16

What is the "Schedule Waterfall" anti-pattern, and what is the recommended alternative?

Answer

The **Schedule Waterfall** anti-pattern occurs when pipeline B is scheduled at a fixed time (e.g., 3:00am) based on the assumption that pipeline A "usually finishes by 2:30am." When pipeline A runs 15 minutes late — due to data volume spikes, infrastructure issues, or upstream delays — pipeline B starts before A has finished, processing incomplete data. The result is silently degraded outputs. The recommended alternative is **event-driven triggering** using sensors or callbacks: pipeline B starts when pipeline A signals completion, regardless of wall-clock time. In Airflow, this is an `ExternalTaskSensor`. In Dagster, asset dependencies handle this automatically — Dagster will not materialize a downstream asset until its upstream assets are fresh. In Prefect, an automation or event trigger can start the downstream flow when the upstream flow completes.

Question 17

The chapter discusses three orchestration frameworks. In what type of environment does each framework have the strongest advantage?

Answer

**Airflow** has the strongest advantage in **diverse-workload environments** — organizations running ETL, ML, analytics, dbt, and data engineering pipelines on a shared platform. Its 80+ provider packages, 10+ years of production track record, and massive community mean that operators exist for almost every external system. Teams with existing Airflow expertise and a mix of workload types will find Airflow's ecosystem hard to match. **Dagster** has the strongest advantage in **data-and-ML-intensive environments** where asset lineage, freshness tracking, and testability are priorities. Teams building multiple ML pipelines that share data assets benefit from Dagster's first-class data lineage, partition management, and IO manager abstraction. The asset-centric model is particularly powerful when the question is "which assets are stale?" rather than "which tasks need to run?" **Prefect** has the strongest advantage in **small-to-medium teams** that want to orchestrate existing Python code with minimal framework adoption cost. The decorator-based API (`@flow`, `@task`) requires no new abstractions beyond standard Python, and flows can be developed and tested by running a single Python file. Teams that value rapid iteration and minimal boilerplate over enterprise features will find Prefect's approach appealing.

Question 18

What is the role of Airflow Pools, and how would you use them in an ML pipeline?

Answer

**Airflow Pools** are resource constraints that limit the number of concurrent task instances that can run in a particular resource category. Each pool has a configurable number of slots, and a task instance must acquire a slot from its assigned pool before executing. In an ML pipeline, pools serve two purposes: 1. **GPU resource management:** A `gpu_pool` with 4 slots (matching 4 available GPUs) ensures that no more than 4 GPU training tasks run simultaneously across all DAGs. Without the pool, multiple concurrent pipelines could over-subscribe the GPU cluster. 2. **External system protection:** A `feature_store_pool` with a limited number of slots prevents too many concurrent feature store queries from overwhelming the Feast online store. Similarly, a `model_registry_pool` limits concurrent MLflow API calls. The StreamRec pipeline uses `pool="gpu_pool"` on the training tasks and `pool="feature_store_pool"` on the feature computation task.

Question 19

Explain how Dagster's FreshnessPolicy provides a more flexible alternative to cron-based scheduling.

Answer

A **FreshnessPolicy** declares that an asset should be no more than a specified duration old, rather than requiring materialization at a specific time. For example, `FreshnessPolicy(maximum_lag_minutes=26*60)` means the retrieval model should have been materialized within the last 26 hours. This is more flexible than cron scheduling because: 1. **Tolerance for delays.** If the 2am pipeline finishes at 4am instead of 3am, the asset is still within the 26-hour window. No false SLA breach is triggered. 2. **Variable scheduling.** If the upstream data arrives at different times on different days, the freshness policy allows the pipeline to run when data is available, as long as the asset stays within the freshness window. 3. **Composition.** Freshness policies compose across the asset graph: if the training features asset has a 24-hour policy and the model asset has a 26-hour policy, Dagster schedules both to meet their respective constraints, accounting for the dependency between them. 4. **Alerting.** When an asset violates its freshness policy, Dagster can alert without requiring a separate SLA monitoring system.

Question 20

The chapter states that "a model in production should be traceable back to the exact pipeline code, configuration, training data, and feature computation logic that produced it." What is the name for this requirement, and what components of the artifact lineage graph support it?

Answer

This requirement is called **reproducibility** (or more specifically, **provenance tracking** / **artifact lineage**). It ensures that given a model in production, an engineer can reconstruct exactly how it was produced and, if necessary, reproduce it. The components of the artifact lineage graph that support this are: 1. **Pipeline code version** — git SHA pinning the exact code that ran 2. **Configuration version** — hyperparameters, feature lists, and thresholds used 3. **Data versions** — Delta Lake table versions or partition identifiers for each input data source 4. **Feature store version** — Feast feature set version and the specific features requested 5. **MLflow run IDs** — linking to the experiment tracker's record of parameters, metrics, and artifacts for each model 6. **Model registry versions** — the registered model version and its stage (Staging, Production) 7. **Environment specification** — Python version, package versions, CUDA version, hardware configuration The `PipelineRunMetadata` dataclass in Section 27.9 captures these components in a structured record that is stored alongside the model artifact.