...

Ingest with PySpark

Learn how to ingest data into the data warehouse using PySpark.

We'll cover the following...

Apache Spark architecture
Data representations
Ingest CSV data into BigQuery with PySpark on Dataproc

Apache Spark is a multi-language engine for executing data engineering and data sciences jobs on single-node machines or clusters. It is known for processing tasks on large datasets and distributing tasks across multiple machines such as personal laptops, Docker Swarm, and Kubernetes.

Apache Spark architecture

Apache Spark architecture has a few key components:

Driver program: It calls the user code and creates a SparkContext. The driver program also contains a DAG scheduler, task scheduler, backend scheduler, and block manager. All of them are responsible for translating user code into executable tasks on the cluster.
Cluster manager: It manages the job execution in the cluster depending on the resources.
Worker node: It processes the tasks. Sometimes, the data will be cached in the worker node. When the task is finished, the worker node will return it to the SparkContext.

Press + to interact

Getting Started

Data Team Structure

Data Engineering Life Cycle

Cloud Data Architecture

Data Ingestion

Data Modeling

Data Orchestration

Mastering Airflow: Building an ETL Pipeline

Data Quality

Build an End-to-End Data Pipeline for Formula 1 Analysis

Epilogue

Appendix

Ingest with PySpark

Apache Spark architecture

Data representations