ETL Pipeline Stages
Learn about the different stages of an ETL pipeline and the recently emerged paradigm.
We'll cover the following
Extract
The first step of any ETL pipeline is the extract step. First, we gather and extract the data. Data used for analytics can be found in both external and internal sources.
A company can choose to extract data from external sources like an Internet of Things (IoT) sensor device, social media data, REST API, statistical data from government websites, web scraping for data, and many more. Alternatively, companies can extract data from internal sources, like their production database, data warehouse, or from applications and services using REST API.
Choosing which source to pull data from depends on the use case. For instance, a company that wants to analyze its user activity will choose something other than an external data source; they’ll pull log activity data from their internal systems.
Companies can also combine data from multiple external or internal sources and in multiple formats, such as CSV, TSV, JSON, XML, etc. This is why the next step is called transform.
Transform
Before loading the data, we usually have to process it in one way or another. Processing the data can accomplish many goals.
First, to load the data, it needs to be in a format that the destination can understand. For example, before loading the data into a relational database, the data must be consistent, tabular, and structured, without missing data or duplicates, and conform to the structure of the tables in the database.
Or we might need to change the data format to load it. For example, we might extract data in XML format and transform it to JSON so that an application like Tableau can use it. Not only that, but the data must also meet the requirements of the business. For example, data used for financial auditing and end-of-month BI reporting must meet specific auditing demands. This means we need to transform the data to meet these requirements.
Transformation can also be used to aggregate data and derive a newly calculated value, eliminate useless data, deal with inconsistent values (such as phone numbers with and without dashes), deal with incomplete, missing, or duplicate rows, and so on.
Remember that we gathered data from internal or external sources in the extract step. To load everything to the same destination, these pieces of data must have the required properties and format that are accepted by the business demands and the final storage destination.
Load
The last stage in the ETL pipeline is the load stage. After extracting and transforming the data, we load it to the destination so that various business users can use it for analytics, predictions, and business insights extraction.
The implementation of this step also depends on the tool we use to store data and the business logic. Data can be fully loaded in a single batch or incrementally loaded in several. We might need to erase previous data before we load a new batch or load the new batch on top of the old data.
We might even need to create additional ETL pipelines or triggers to handle the data after it has been loaded or store it as it is and let other business users, such as data analysts and data scientists, handle it the way they see fit.
Scheduling
Depending on the use case, the entire ETL pipeline will likely be scheduled, monitored, or automated by a data orchestration solution like Apache Airflow or even simpler tools like Python's pandas or cron.
For example, orchestrating an ETL pipeline that constantly pulls the last hour’s currency rates for a forex investing company will add much value to the pipeline. It will allow the users to break down the various jobs in the process and have much better control over it.
Orchestrating ETL pipelines is a good way to reduce errors and human mistakes, save time, improve accuracy, reduce manual effort, and more.