Introduction
This lesson introduces you to Apache Airflow.
We'll cover the following
Apache Airflow is an open-source workflow management platform. It started at Airbnb in October 2014 as a solution to manage the company’s increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their workflows and monitor them via the built-in Airflow user interface.
Airflow can be described as a platform for defining, executing, and monitoring workflows. A workflow can be defined as any sequence of steps taken to accomplish a particular goal. Imagine that at your company, there’s a job that copies log data from machines and uploads it to an S3 bucket. A second MapReduce job reads log data from that S3 bucket and computes any anomalies detected (e.g., too many logins) and writes them out to an HDFS location. And finally, a third job reads the output of the second job and inserts it into a relational database. Such pipelines are a common occurrence at enterprises. One of the challenges faced by growing Big Data teams and their use cases has been the ability to stitch together related jobs into an end-to-end workflow. The tool of choice for describing workflows before Airflow existed was Oozie, but it came with its own limitations. Gradually over the years, Airflow has overtaken Oozie in popularity for creating complex workflows.
Pipeline example
The difference between Apache Airflow and other workflow management systems
Here are some of the differences between Airflow and other Big Data workflow management platforms, such as Oozie:
- DAGs (Directed Acyclic Graph) are written in Python language, which has a lower learning curve and is more widely used by less technically savvy folks compared to Java, which is used by Oozie.
- Airflow has a huge community contributing to it, which is why it is easy to find integration solutions for every major service/cloud provider.
- Airflow is more versatile, expressive, and capable of creating extremely complex workflows. Airflow provides advanced metrics on workflows.
- Airflow’s API is richer, and the UI has been deemed better than most other workflow management systems.
- One of the key differentiating features of Airflow is the use of templating to replace variables or expressions with values when a template gets rendered. This allows for use cases such as referencing a unique filename that corresponds to the date of the DAG run. Jinja is the Python template engine used by Airflow to provide pipeline authors with a set of built-in parameters and macros.
As of this writing, managed Airflow cloud services have also been introduced, such as Google Composer and Astronomer.io.
Example
Airflow works with Python, which has contributed to its vast acceptance, as Python is a widely practiced language. Lets work with an example below to demonstrate the simplicity and flexibility with which pipelines and workflows can be set up using Airflow.
# Click on the button "Click to launch app!" in the widget below# After a while, you'll see Airflow's webserver UI load in the widget# Click on the arrow to the right end of the widget, and the UI will open# up in a new tab. You'll see a list of DAGs. A DAG stands for Directed# Acyclic Graph, and you can think of it as a blueprint for a workflow.# The first DAG in the list will be "example_bash_operator". You can click# the button to its left to enable it. After a while, refresh the browser and# observe the workflow example_bash_operator run to completion. You should# see an entry for the column "Last Run"# Congratulations! you just ran your first Airflow DAG workflow :)