...

/

Architecture of Apache Spark

Architecture of Apache Spark

Understand the architecture of Apache Spark.

In this chapter, we will discuss the architecture of Apache Spark. This is an example of a distributed system that achieves one common goal.

In Spark architecture, there are multiple components that work together. These components consist of multiple processes or nodes. In most cases, Spark is deployed as a cluster of multiple nodes for big data processing. We will discuss the components and show how they interact with each other.

High-level Apache Spark architecture

In order to understand the architecture, we first need to know a few Spark-specific concepts.

Resilient Distributed Dataset (RDD)

RDD is a Spark abstraction for the data that is being processed. When we input some data in a Spark program, Spark reads the data and creates a RDD. Under the hood, Spark uses RDD to process the data and create results.

So what is an RDD? Let’s go over each part of the name.

  • Resilient: Fault-tolerant. If there is a failure, the data ...