A case study

Suppose we’re data engineers working for a digital company and we’re tasked with creating an ETL pipeline.

Our company, “Fakebook,” has created a social media application that users use worldwide. This application constantly generates data stored in the company’s production database for management.

The company wants to process and analyze the data collected by the application to generate insights and identify usage patterns. However, these analyses in the production database will introduce a heavy load. This is why the company has decided to separate the computing and storage of the data and perform all the analysis in a separate repository called the data warehouse.

Because of that, we’re tasked with creating and scheduling an ETL pipeline to ...