Orchestration Tool: Dagster

Learn about Dagster, an Airflow alternative.

We'll cover the following

At the time when Airflow was created, it was no doubt an amazing technology and quickly became the pioneer in the field of data orchestration. However, several years later, the limitations of Airflow became increasingly evident, and data teams began encountering a range of difficulties. For example, Airflow pipelines are not easy to develop and test outside of production deployment, and data is not given priority in Airflow, making it challenging to link the data assets that are important to us with the corresponding tasks.

The founder of Dagster saw these challenges as an opportunity and created Dagster in 2018 with a team of engineers at Elementl, a software company dedicated to building tools to help teams work better with data. This lesson will provide an overview of Dagster's architecture and highlight its most essential design principles, demonstrated through live examples. This lesson uses Dagster version 1.4.7.

Architecture

Data pipelines in Airflow are typically bound to a specific environment. For example, all DAGs must employ the same version of Python and its packages. If we want to require an isolated environment, we can use operators like PythonVirtualenvOperator or KubernetesPodOperator. However, it's impossible to run and test KubernetesPodOperator locally or as part of the CI.

One of the key innovations in Dagster's architecture is the ability to decouple a pipeline from any particular environment. Dagster uses a similar architecture as Airflow, but it moves the scheduling tier to a GRPC-compliant microservice that can accept any type of "user deployments." The deployments are spawned as isolated docker containers, which can solve the dependency hell between data pipelines as well as the conflict between data pipelines and the scheduler. This means we can upgrade the scheduler without upgrading the user code and vice versa.

Get hands-on with 1200+ tech skills courses.