Ingest with PySpark
Learn how to ingest data into the data warehouse using PySpark.
Apache Spark is a multi-language engine for executing data engineering and data sciences jobs on single-node machines or clusters. It is known for processing tasks on large datasets and distributing tasks across multiple machines such as personal laptops, Docker Swarm, and Kubernetes.
Apache Spark architecture
Apache Spark architecture has a few key components:
Driver program: It calls the user code and creates a SparkContext. The driver program also contains a DAG scheduler, task scheduler, backend scheduler, and block manager. All of them are responsible for translating user code into executable tasks on the cluster.
Cluster manager: It manages the job execution in the cluster depending on the resources.
Worker node: It processes the tasks. Sometimes, the data will be cached in the worker node. When the task is finished, the worker node will return it to the SparkContext.
Get hands-on with 1400+ tech skills courses.