What is this course about?

Data engineering is a fast-growing tech field with extremely high demand. According to DICE’s 2022 Tech Job Report, data engineering was one of the most demanding jobs in 2021. The job posting volume grows by 42.2% year over year.

In the wake of the surge in data scientists, companies realized that they need a proper data infrastructure to perform data analysis and apply machine learning algorithms. Companies started to invest in modern data stacks and hire new data engineers. It's also shown by Forbes that data preparation accounts for about 80% of the work of data scientists. Therefore, learning data engineering skills is critical for data scientists to perform their daily jobs more efficiently.

This course teaches the foundation of data engineering through practical theories and hands-on coding projects. It builds a solid foundation for solving real-world data engineering problems. Upon completing the course, you'll will understand the following:

What comprises a data team? How should a data team be structured?
What is a data engineering life cycle?
What are the different types of cloud data architecture? What is a well-designed data architecture?
What are the different types of data ingestion?
What are the steps to create dimensional modeling?
How can transform data in SQL and Python?
How can we orchestrate data pipelines? What are the tools?
How can we ensure data quality?

Who should take this course?

This course is also suitable for entry-level and intermediate-level data engineers who want to consolidate their data engineering knowledge and prepare for a data engineer interview. The primary audiences of this course are data engineers, data scientists, and machine learning engineers. However, anyone who works in the broader data domain is welcome to take this course.

Structure of the course

The course starts with the theory of data engineering, including data team structure, data engineering life cycle, and cloud data architecture. Each component of the data engineering life cycle is navigated in depth with many coding examples. The course concludes with a data pipeline project to be built from scratch. The outline of the course is listed below:

Data team structure
Data engineering life cycle
Cloud data architecture
Data ingestion
Data modeling
Data orchestration
Data quality
Building an end-to-end data pipeline

Required resources

The course provides a built-in environment to run all the coding examples, and some of them must run in the cloud. The modern data engineering field heavily relies on cloud providers such as AWS, Google Cloud, Azure, etc. In this course, we will try out different cloud services on the Google Cloud Platform (GCP).

Follow the instructions in the Appendix chapter to create an account. GCP offers a free trial on different products. One of the free products is BigQuery (data warehouse) sandbox, a non-risk environment for us to explore several public datasets, and we will use it intensively throughout the course.

Note: After the free trial, don't forget to monitor the cost. Making use of the billing dashboard to understand where the money is spent and shut down unnecessary services to avoid unexpected bills.

So, let’s get started!

Getting Started

Data Team Structure

Data Engineering Life Cycle

Cloud Data Architecture

Data Ingestion

Data Modeling

Data Orchestration

Mastering Airflow: Building an ETL Pipeline

Data Quality

Build an End-to-End Data Pipeline for Formula 1 Analysis

Epilogue

Appendix

Introduction

What is this course about?

Who should take this course?

Structure of the course

Required resources