Apache Spark architecture

Abstractions

Before diving into the actual architecture of Spark, it is important to note the two most significant abstractions that Spark uses for data management.

1. Resilient Distributed Datasets (RDDs)

RDD is a collection of records/datasets that are used by executors for computations. Datasets stored in RDD can be objects of different languages including Python, Scala, or Java. These datasets are immutable and:

Resilient: fault-tolerant
Distributed: spread over multiple nodes

2. Direct Acrylic Graph (DAG)

DAG allows Spark to create a sequence of events of tasks. DAG is a set of vertices and edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD. In Spark DAG, every edge directs to a different area in the sequence.

Spark uses the master/slave architecture, which is dependent on two daemons:

Master Node
Worker Nodes

A Cluster Manager binds these daemons together.

Master Node

Master Node is the hub of management for Spark. It runs the main() through SparkContext. SparkContext is your asset, which allows you to use all the Spark functionalities. The Driver Program contains various components such as DAGScheduler, TaskScheduler, BackendScheduler, and BlockManager. The Driver Program communicates with the Cluster Manager and schedules tasks for different processes. A job is split into multiple tasks that are then distributed over the worker node. Anytime an RDD is created in the SparkContext, it can be distributed across various nodes and cached there.

Worker Node

Worker Nodes handle the execution of tasks scheduled to it by the Masker Node. There is only one Master Node, but there are multiple Worker Nodes. The Executors carry out all the computation on the RDD partitions and in the Worker Node – the results are returned to the SparkContext.

It is important to note that you can increase the number of workers, which will allow you to divide jobs into more partitions and execute them, in parallel, over multiple systems. With the increase in number of workers, the memory size will also increase, which will allow you to cache jobs and execute them faster.

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

TRENDING TOPICS

Learn to Code

Tech Interview Prep

Generative AI

Data Science

Machine Learning

GitHub Students Scholarship

Early Access Courses

Blind 75

Layoffs

Pricing

For Individuals

Try for Free

Gift a Subscription

CONTRIBUTE

Become an Author

Become an Affiliate

Earn Referral Credits

RESOURCES

Blog

Cheatsheets

Webinars

Answers

ABOUT US

Our Team

Careers

Hiring

Frequently Asked Questions

Press

LEGAL

Cookie Policy

Business Terms of Service

Data Processing Agreement

INTERVIEW PREP COURSES

Grokking the Modern System Design Interview

Grokking the Product Architecture Design Interview

Grokking the Coding Interview Patterns

Machine Learning System Design