This device is not compatible.
PROJECT
Build ETL Pipeline in GCP: Load GCS Data to BigQuery via Dataflow
In this project, we’ll learn to create an extract, transform, load (ETL) pipeline using Google Cloud Platform (GCP) services. We’ll extract the data from GCP storage buckets, clean and transform it using Apache Beam and Dataflow, and finally load it to BigQuery.
You will learn to:
Build and implement an ETL pipeline using various GCP tools.
Use data validation, cleaning, transformation, and aggregation techniques using Apache Beam (Python).
Create a Google Cloud Services bucket, BigQuery dataset, table, and view.
Use a GCP service account for proper pipeline access control.
Monitor activities for running Dataflow ETL pipeline.
Execute various GCP command-line interface (CLI) commands.
Skills
Data Pipeline Engineering
Data Extraction
Data Analysis
Data Manipulation
Data Cleaning
Prerequisites
Good understanding of Python
Good understanding of BigQuery
Good understanding of Apache Beam
Good understanding of Google Cloud services
Technologies
GCP
Python
BigQuery
Apache Beam
Project Description
In this project, we’ll design a Google Cloud Dataflow ETL pipeline to extract raw Airbnb data from a cloud storage bucket. Then we’ll load it into BigQuery tables and views for analytical purposes after performing various data cleaning and transformation tasks.
Technically, the pipeline will be written using Apache Beam’s Python software development kit (SDK) using the following steps:
Ingest data from a Google Cloud Storage bucket.
Validate the ingested data with various integrity checks for values that are missing, null, or outside the prescribed range.
Clean the data to remove unnecessary columns, special characters and symbols, and duplications.
Transform and enrich the data using various techniques like filtering, grouping, and aggregation.
Load the transformed data to BigQuery tables and views.
Appropriate access control and security mechanisms will be implemented to run the ETL pipeline using GCP identity and access management (IAM) authorization and authentication techniques. Following a real-time approach, we’ll create and run the pipeline through a service account, not a personal user account.
During pipeline execution, we’ll monitor it using Dataflow monitoring tools. Once the pipeline finishes, certain validation and testing steps will be carried out to validate the pipeline’s output in BiQuery. This project focuses on the fundamental ETL process of extracting data from one source and loading it into another, demonstrating how to ETL unaltered data for analysis in Google Cloud.
Project Tasks
1
Introduction
Task 0: Get Started
Task 1: Set Up and Configure GCP Project
2
Upload Data to Google Cloud Storage and Create Service Account
Task 2: Create Cloud Storage Bucket
Task 3: Upload Data to Storage Bucket
Task 4: Create Service Account and Assign Roles
3
Build the ETL Pipeline
Task 5: Getting Started with Apache Beam
Task 6: Create a Pipeline Object with Options
Task 7: Read Data from GCS Bucket in the Pipeline
Task 8: Parse the Input Data
Task 9: Apply Data Validation Steps in the Pipeline
Task 10: Apply Data Cleaning Steps in the Pipeline
Task 11: Enrich the Data
Task 12: Perform Aggregation Operations
4
Write Data to BigQuery
Task 13: Create BigQuery Dataset
Task 14: Upload Final Data to BigQuery Table
Task 15: Execute the Pipeline
Task 16: Create Views
Task 17: Validate the Results
5
Run Complete Python Code using Dataflow Runner
Task 18: Complete Python Code with Dataflow Runner
Task 19: Monitor the Pipeline Execution
Task 20: Validate the Pipeline Results
Task 21: Clean Up
Congratulations!
Relevant Courses
Use the following content to review prerequisites or explore specific concepts in detail.