Deploying Airflow
Learn about Apache Airflow’s components and how to deploy it.
We'll cover the following
Deploying Airflow involves setting up the environment to run, schedule, and monitor data pipelines. Airflow’s environment consists of several components that work together to manage and execute workflow efficiently.
Airflow’s components
Database: Airflow requires a database to store its metadata, including information about DAGs, tasks, task instances, and their statuses. The metadata database is essential for maintaining the state of running and completed tasks. The default database (used for development) is SQLite. Consider using a more robust database like PostgreSQL, MySQL, or Microsoft SQL Server for production.
Airflow webserver: It provides a web-based user interface for interacting with and monitoring workflows. It allows users to view DAGs, check task status, and manually trigger DAG runs. It’s accessible through a web browser.
Scheduler: The Airflow scheduler is responsible for triggering task instances based on the defined schedules and dependencies. It periodically queries the metadata database to determine which tasks need to be executed and when.
Workers: Airflow workers are responsible for executing task instances. The scheduler assigns tasks to the workers, who run them in separate processes or containers.
We can deploy all of these components in a single node or distribute them across several. Deploying Airflow in a single node means that Airflow’s components (database, webserver, scheduler, and workers) run on a single machine. As the workload increases, we might experience performance limitations.
When that happens, we should consider distributing the components across multiple machines and adding more workers. By distributing the components, we add a layer of fault tolerance to the system by loosely coupling the components. If a single component goes down, the entire system can remain active.
Additionally, by distributing the components, we can scale the workers to support increased loads and execute pipelines in parallel. We should consider our use case and performance requirements before deciding the appropriate deployment mode for Apache Airflow.
Example
Let’s look at how to install and deploy all Airflow components on a single node. First, to install Airflow, we’ll use Python’s package manager pip
to install several packages needed to run Airflow. These packages can be found in a constraint file provided by the Airflow community.
All we have to do is grab our current Python version and specify the Airflow version we want to install. Then, add them to the constraint URL as shown below.
Note: We’ve already preconfigured Apache Airflow on this environment so there’s no need to execute these commands.
Get hands-on with 1400+ tech skills courses.