Airflow Basics

Learn basic concepts of Airflow such as its architecture, operator, and sensor.

When it comes to learning data orchestration tools, Airflow is a must-have. Airflow is a widely recognized open-source workflow management tool for data engineering pipelines. It has become the go-to choice for many data teams to streamline their data orchestration processes. Throughout this lesson, we will delve into the fundamental concepts of Airflow with the help of examples.

History

Airflow was started in 2014 at Airbnb as a solution to manage Airbnb’s increasingly complex data workflows. In 2015, Airbnb open-sourced Airflow. Since then, it has gained significant popularity among data engineering teams looking for a reliable workflow management solution. In 2019, the Apache Software Foundation officially adopted Airflow as an incubation project and in 2020, it became a top-level project with a stable codebase and a strong community.

Architecture

At its core, Airflow creates workflows in the form of DAGs. Each DAG consists of individual units of work called Tasks, each representing a single unit of work that needs to be performed. The DAG also declares the dependencies between tasks to ensure they are executed in the correct order.

In general, a DAG is a workflow or a process such as:

  • ETL pipeline: Perform the ETL process to extract data from different databases, transform it into a unified format, and load it into a data warehouse for further analysis.

  • Machine learning pipeline: Perform the entire process of model training, evaluation, and deployment. ...