Parallelism

This lesson explains the various Airflow settings that can affect the parallelism for a DAG and the Airflow cluster.

In this lesson, we’ll examine the various aspects that affect parallelism at the DAG level and across the entire system.

Parallelism

The parallelism configuration parameter limits the number of tasks that are actively being executed across the entire system. By default, this value is set to 16 in airflow.cfg.

Pool

Pools are one of the ways to limit the number of tasks that run at any given time. A pool exists by default and is named the default pool, consisting of 128 worker slots. Each slot can be used to run a task. We can change the number of slots for a given pool and also create new pools, e.g., using the UI. A task can be assigned to a specific pool using the pool parameter.

task = BashOperator(
    task_id='create_file',
    bash_command='touch ~/Greet.txt',
    pool="MySpecialPool", # Pool assignment
    dag=dag,)

If a DAG is run with worker slots set to 1 for the default pool, a misbehaving task that never ends can starve other tasks from running in that pool. Pools have built-in intelligence to run tasks based on the priority_weight parameter. Tasks are scheduled for execution until all available slots fill up. Once the capacity is reached, runnable tasks get queued, and their state is reflected in the UI. As slots free up, queued tasks start running based on the priority_weight of the task and its descendants.

The idea behind pools is to limit tasks of a particular type from running at the same time. For instance, consider tasks from disparate DAGs that attempt to establish connections to a database or make API calls. We would want to throttle or limit such tasks so as not to overwhelm the systems they connect with.

dag_concurrency

Another parameter that is defined in the airflow.cfg configuration file determines how many tasks for a given DAG can execute simultaneously. By default, this is set to 16. This is a per DAG concurrency parameter.

worker_concurrency

The configuration parameter, worker_concurrency, determines how many tasks a single worker can execute at the same time. This parameter only affects the Celery executor. Increasing worker_concurrency may require providing more resources to the workers to handle the load.

max_threads

The max_threads parameter affects the scheduler component of Airflow. As other parameters are tweaked to increase concurrency in the system, the additional load can slow down the scheduler. The max_threads parameter allows the scheduler to run threads up to the limit specified by the max_threads parameter. Increasing the threads can prevent the scheduler from getting behind, delaying task execution, or showing gaps in the chart displayed in the Airflow UI but may also require a corresponding increase in compute and memory resources.

Relationship

The knobs and levers Airflow exposes to control parallelism in the system can be confusing, but they are all related to each other. For instance, if we set the parallelism parameter to 2 and dag_concurrency to 10, we’ll only be able to run two tasks simultaneously for a given DAG. Similarly, if we change dag_concurrency to 10 but parallelism to 5, then we’ll only be able to execute five tasks in parallel for a DAG even though we are allowed to run ten tasks for a DAG concurrently.

Get hands-on with 1300+ tech skills courses.