Parallelism
This lesson explains the various Airflow settings that can affect the parallelism for a DAG and the Airflow cluster.
We'll cover the following
In this lesson, we’ll examine the various aspects that affect parallelism at the DAG level and across the entire system.
Parallelism
The parallelism
configuration parameter limits the number of tasks that are actively being executed across the entire system. By default, this value is set to 16 in airflow.cfg
.
Pool
Pools are one of the ways to limit the number of tasks that run at any given time. A pool exists by default and is named the default pool, consisting of 128 worker slots. Each slot can be used to run a task. We can change the number of slots for a given pool and also create new pools, e.g., using the UI. A task can be assigned to a specific pool using the pool parameter.
task = BashOperator(
task_id='create_file',
bash_command='touch ~/Greet.txt',
pool="MySpecialPool", # Pool assignment
dag=dag,)
If a DAG is run with worker slots set to 1 for the default pool, a misbehaving task that never ends can starve other tasks from running in that pool. Pools have built-in intelligence to run tasks based on the priority_weight
parameter. Tasks are scheduled for execution until all available slots fill up. Once the capacity is reached, runnable tasks get queued, and their state is reflected in the UI. As slots free up, queued tasks start running based on the priority_weight
of the task and its descendants.
The idea behind pools is to limit tasks of a particular type from running at the same time. For instance, consider tasks from disparate DAGs that attempt to establish connections to a database or make API calls. We would want to throttle or limit such tasks so as not to overwhelm the systems they connect with.
dag_concurrency
Another parameter that is defined in the airflow.cfg
configuration file determines how many tasks for a given DAG can execute simultaneously. By default, this is set to 16. This is a per DAG concurrency parameter.
worker_concurrency
The configuration parameter, worker_concurrency
, determines how many tasks a single worker can execute at the same time. This parameter only affects the Celery executor. Increasing worker_concurrency
may require providing more resources to the workers to handle the load.
max_threads
The max_threads
parameter affects the scheduler component of Airflow. As other parameters are tweaked to increase concurrency in the system, the additional load can slow down the scheduler. The max_threads
parameter allows the scheduler to run threads up to the limit specified by the max_threads
parameter. Increasing the threads can prevent the scheduler from getting behind, delaying task execution, or showing gaps in the chart displayed in the Airflow UI but may also require a corresponding increase in compute and memory resources.
Relationship
The knobs and levers Airflow exposes to control parallelism in the system can be confusing, but they are all related to each other. For instance, if we set the parallelism
parameter to 2 and dag_concurrency
to 10, we’ll only be able to run two tasks simultaneously for a given DAG. Similarly, if we change dag_concurrency
to 10 but parallelism
to 5, then we’ll only be able to execute five tasks in parallel for a DAG even though we are allowed to run ten tasks for a DAG concurrently.
Get hands-on with 1300+ tech skills courses.