Chapter 27: Further Reading

DataField.Dev

Affiliate disclosure

Book titles on this page link to Amazon. As an Amazon Associate, DataField.Dev earns from qualifying purchases — at no additional cost to you.

Chapter 27: Further Reading

Essential Sources

1. Maxime Beauchemin, "The Rise of the Data Engineer" (2017) and the Apache Airflow Documentation

Beauchemin created Airflow at Airbnb in 2014 and open-sourced it in 2015. His 2017 essay, published on the Airflow blog, articulates the philosophy behind the project: that data pipelines should be defined as code (Python DAGs), versioned, tested, and monitored with the same rigor as application code. The essay also introduces the distinction between ETL engineers and the emerging role of the "data engineer" — a role that has since become central to every ML platform team.

Reading guidance: The essay provides the intellectual context for Airflow's design decisions, but the primary reference is the Apache Airflow documentation itself (airflow.apache.org). Start with the "Concepts" section, which covers DAGs, operators, sensors, XComs, pools, and trigger rules. The "Best Practices" page is particularly valuable: it covers DAG file processing performance, avoiding top-level code in DAG files, using the TaskFlow API (introduced in Airflow 2.0) for cleaner Python-based DAGs, and the @task decorator that simplifies XCom usage. For production deployment patterns, see the "Kubernetes Executor" documentation, which describes how Airflow can launch each task as an isolated Kubernetes pod — the recommended pattern for ML pipelines that require heterogeneous resources (GPU pods for training, CPU pods for data processing). For a critical perspective on Airflow's limitations, see "Airflow, We Need to Talk" by Robert Chang (Medium, 2021), which catalogs pain points around DAG parsing, testing, and the global namespace.

2. Nick Schrock, "Introducing Software-Defined Assets" (Dagster Blog, 2022) and the Dagster Documentation

Schrock, co-creator of GraphQL and founder of Dagster (originally Elementl), published a series of blog posts in 2021–2022 articulating the asset-centric paradigm that distinguishes Dagster from Airflow. The central argument: orchestrators should track data assets (tables, files, model artifacts), not tasks (units of computation). A task is a means to an end; the asset is the end itself. This inversion — from "what should I run?" to "what should exist?" — enables automatic lineage tracking, freshness policies, and partition-aware materialization.

Reading guidance: Start with "Software-Defined Assets: A New Paradigm for Data Orchestration" (dagster.io/blog), which explains the motivation with concrete examples. Then read the Dagster documentation's "Concepts" section, focusing on Assets, IO Managers, Partitions, and Resources. The "Testing" guide is essential for this chapter: it describes build_asset_context for unit testing assets and materialize_to_memory for integration testing. For the IO manager abstraction, see the "IO Managers" concept page, which explains how the same asset code can write to local filesystem (development), S3 (staging), and Delta Lake (production) by swapping a single resource configuration. For a production deployment reference, see Dagster Cloud's documentation on "Branch Deployments," which provides per-pull-request pipeline environments — a CI/CD pattern that Airflow does not natively support. For a balanced comparison of Airflow and Dagster, see Sandy Ryza's "Why Dagster?" talk (Data Council, 2023), which presents the trade-offs without dismissing Airflow's strengths.

3. Jeremiah Lowin, "Why Prefect?" (Prefect Blog, 2019) and the Prefect 2.0 Documentation

Lowin's founding essay for Prefect argues that orchestration frameworks should eliminate "negative engineering" — the defensive code for retries, logging, state management, and failure notification that typically constitutes 60% of a pipeline script. Prefect's approach is to make standard Python functions orchestrable with minimal decoration (@flow, @task), preserving the development experience of writing and testing regular Python code while adding production-grade scheduling, retries, and observability.

Reading guidance: The Prefect documentation (docs.prefect.io) is organized around "Getting Started" tutorials and "Concepts" reference pages. The "Tasks" concept page covers retries, caching (cache_key_fn + cache_expiration), tags, and concurrency control. The "Deployments" page explains how a flow is packaged for remote execution — specifying the work pool, schedule, and infrastructure (Docker, Kubernetes, serverless). The task.submit() pattern for concurrent execution is covered in the "Futures" section and is Prefect's primary mechanism for parallelism within a flow. For a comparative perspective, see Lowin's PyCon 2023 talk "Orchestration Without the Orchestrator," which positions Prefect against both Airflow and Dagster and explains the trade-offs of the decorator-based approach. For production patterns, the Prefect blog's series on "Marvin" (Prefect's AI framework) shows how orchestration patterns extend to LLM-based pipelines — a growing use case as organizations integrate LLM calls into their ML workflows.

4. D. Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (NeurIPS, 2015)

Referenced in Chapter 24 and essential background for this chapter. Sculley et al.'s analysis of ML technical debt identifies pipeline jungles — tangled webs of data preparation scripts that grow organically and resist testing, monitoring, and modification — as one of the most common forms of ML technical debt. Pipeline orchestration frameworks are the primary engineering response to pipeline jungles: by formalizing pipelines as DAGs with explicit dependencies, typed interfaces, and automated testing, they transform ad hoc scripts into maintainable software systems.

Reading guidance: Section 3 (Pipeline Jungles) describes the problem that orchestration frameworks solve. Section 4 (Configuration Debt) is directly relevant to pipeline versioning — the challenge of managing the hyperparameters, feature lists, and thresholds that configure each pipeline run. Section 6 (Dealing with Changes in the External World) connects to the backfill and monitoring concerns of this chapter: when the external world changes (upstream schema migration, distribution shift, regulatory requirement), the pipeline must adapt without introducing technical debt. For a quantitative follow-up, see "An Empirical Study of Technical Debt in Machine Learning Systems" (Bogner et al., MSR, 2021), which surveys industry practitioners on the prevalence and impact of each debt category identified by Sculley et al.

5. Martin Kleppmann, Designing Data-Intensive Applications, Chapter 10: Batch Processing (O'Reilly, 2017)

Kleppmann's treatment of batch processing provides the theoretical foundation for pipeline orchestration. Chapter 10 covers the MapReduce computational model, the distinction between bounded and unbounded datasets, the concept of materialized intermediate state (which is what pipeline assets are), and the principle that batch jobs should be deterministic and idempotent — producing the same output given the same input, enabling safe retries and reprocessing. The chapter's discussion of "philosophy of batch process outputs" — that outputs should be treated as derived data that can always be re-derived from inputs — directly informs the idempotency and backfill strategies in this chapter.

Reading guidance: Section "Materialization of Intermediate State" explains why pipeline orchestrators store intermediate results (validated data, training features) rather than recomputing them on every run — and under what conditions recomputation is preferable to caching. Section "Joins" covers the temporal join pattern (point-in-time correct feature lookups) that the feature computation task implements. The chapter concludes with a discussion of "Beyond MapReduce" that covers dataflow engines (Spark, Flink) — the execution engines that pipeline orchestrators schedule. For readers who want to understand the streaming counterpart (relevant for real-time feature pipelines), Chapter 11 covers stream processing and the relationship between batch and stream processing in the lambda and kappa architectures.