ETL Pipeline Exercise: Extracting Data
Let’s extract media data from a PostgreSQL database.
A case study
Suppose we’re data engineers working for a digital company and we’re tasked with creating an ETL pipeline.
Our company, “Fakebook,” has created a social media application that users use worldwide. This application constantly generates data stored in the company’s production database for management.
The company wants to process and analyze the data collected by the application to generate insights and identify usage patterns. However, these analyses in the production database will introduce a heavy load. This is why the company has decided to separate the computing and storage of the data and perform all the analysis in a separate repository called the data warehouse.
Because of that, we’re tasked with creating and scheduling an ETL pipeline to transfer social media related data from our company’s production database to the data warehouse.
According to the business requirements, the ETL pipeline should be scheduled to run every four hours and use incremental loading methods to load the latest data generated by the application.
Extract
The first task is to acquire the relevant data. According to our scenario, we need to extract the latest data from the production database. For this example, the production database comprises three tables: users, posts, and comments.
The users table stores information about social media users.
Get hands-on with 1400+ tech skills courses.