Scheduling
This lesson clarifies the working of schedule_interval and start_date, which can be confusing for complex crontab expressions.
When initially working with Airflow, it is common to get confused with how all the scheduling parameters work together. In this lesson, we’ll explore the differences between the various parameters.
When creating a DAG, we can specify the start_date
and a schedule_interval
as parameters to the constructor. Let’s see an example:
dag = DAG(
'Example9',
default_args=default_args,
description='Example DAG 9',
schedule_interval='@daily',
start_date=datetime(2020, 9, 5))
In the Example9
DAG, we set start_date
to 5th Sept 2020
, and the schedule_interval
is set to @daily
. Note that @daily
is an alias for the 0 0 * * *
crontab expression. There are other aliases for commonly used schedules, such as @weekly
, @monthly
, and @yearly
, which all translate to crontab expressions under the hood. You can provide a crontab expression for the schedule_interval
parameter for complex schedules. A good resource to work with crontab expressions is crontab.guru. Remember that Airflow works with UTC by default but can be configured to work with your local time too. Airflow will also schedule DAG runs for the previous days even though we are running the DAG now. Each DAG run will be associated with an execution date and a start date. The execution date is the date that the DAG should have run, and the start date is when Airflow actually runs it. Please don’t confuse the start_date
that we pass into the DAG constructor with the start date associated with a DAG run; both are distinct.
The combination of start_date
and schedule_interval
implies that the DAG Example9
should have run starting from 5th Sept 2020
up until now. Let’s further assume that now is 9th Sept 2020
i.e., you are running the DAG on 9th of Sept 2020. Airflow is smart enough to run the missing DAGs for 5th, 6th, 7th, and 8th of Sept for you too. All these four runs will have a start date of Sept. 9th (since all of them were kicked off on Sept. 9th) but will all have different timestamps on the 9th. Additionally, the execution dates for these DAGs will span from Sept. 5th to Sept. 8th.
As you can see from the screenshot above, Airflow created DAG runs for each of the prior days. The execution date for each DAG run is distinct, but the start dates for all of the runs is the 9th of Sept. Sure, the hours, minutes, etc., differ because they are run at different times on the 9th of Sept. The astute reader would observe that the run for the 9th of Sept. is missing from the list in the screenshot above. This is because Airflow has its roots in ETL, which involves running batch jobs at the end of the day for that day. For instance, the data for the 4th of July will be collected for the entire day, and the ETL job for the 4th of July will actually run at 12 a.m. on the 5th of July. This makes sense because the job for a day should wait to have all the data for that day before running. In our example, the DAG run for the 9th of Sept. isn’t executed until 12 a.m. on the 10th of Sept., which is why it doesn’t show up in the listing in the screenshot. Thinking on the same lines, if the schedule for a DAG is 3 p.m. every day, then the DAG run for the previous day will run immediately after 3 p.m. the next day (and not 12 a.m.). There will be a full 24 hour delay before the DAG run for the previous day executes.
It is interesting to consider what happens if you set the schedule_interval
to run every Monday and Friday. Let’s say we set the start_date
for the DAG as August 15th, 2020.
Day | Date | DAG Execution Date | DAG Start Date |
---|---|---|---|
Sunday | Aug. 15th, 2020 | - | - |
Monday | Aug. 16th, 2020 | - | - |
Tuesday | Aug. 17th, 2020 | - | - |
Wednesday | Aug. 18th, 2020 | - | - |
Thursday | Aug. 19th, 2020 | - | - |
Friday | Aug. 20th, 2020 | Aug. 16th, 2020 | Aug. 20th, 2020 |
Saturday | Aug. 21st, 2020 | - | - |
Sunday | Aug. 22nd, 2020 | - | - |
Monday | Aug. 23rd, 2020 | Aug. 20th, 2020 | Aug. 23rd, 2020 |
Tuesday | Aug. 24th, 2020 | - | - |
Wednesday | Aug. 25th, 2020 | - | - |
Thursday | Aug. 26th, 2020 | - | - |
Friday | Aug. 27th, 2020 | Aug. 20th, 2020 | Aug. 27th, 2020 |
Saturday | Aug. 28th, 2020 | - | - |
The DAG’s start date is set to Aug. 15th, 2020 (not to be confused with a DAG run’s start date, as shown in the table above). The first DAG run is for Aug. 16th, 2020, but it’ll start on Friday, Aug. 20th, 2020! This is an idiosyncrasy of Airflow that can confuse seasoned engineers. The DAG runs for Aug. 16th, 2020 actually runs on Aug. 20th, 2020; the execution date for the DAG is Aug. 16th, 2020, but the start date is Aug. 20th, 2020. The DAG run with the execution date of Aug. 20th, 2020 will have a start date of next Monday, i.e., Aug. 23rd, 2020, and so on and so forth.
Finally, you may note that a DAG doesn’t execute at exactly the time it is supposed to run, e.g., a DAG run may start at 10:01 p.m. when its schedule asks it to run at 10:00 p.m. There may be a delay of a few seconds or so. There’s a configuration parameter, scheduler_heartbeat_sec
, defined in airflow.cfg
that controls how often the Airflow scheduler runs. The scheduler runs and looks for tasks to trigger, and there may be a delay in when a task becomes due and when the scheduler is able to run it. Making the scheduler run at a higher frequency can put pressure on the database, so any tweaks should be done cautiously.
Get hands-on with 1400+ tech skills courses.