High-level Design of Spark

Get introduced to the primary building blocks and programming model of Spark.

Building blocks

Building blocks of Spark include resilient distributed datasets, driver, and worker nodes. The details of these components have been described briefly in this lesson.

Resilient distributed datasets (RDDs)

  • They are an abstraction, a read-only collection of resilient objects stored across a cluster of machines.

  • RDDs can be created in two ways––by applying transformation on an existing RDD or by reading data from a distributed file system.

    • Whenever an RDD is created, it has partitions of data in it.

    • Those partitions are saved on a cluster of machines.

  • For example, let's say an RDD is initially created from a file, then a subsequent RDD is created from that RDD, and so on.

  • Spark will keep a graph that records the sources of all the RDDs called a lineage graph.

  • RDDs implement an interface that keeps the following details:

    • A list of partition objects that contains their own sets of data

    • An iterator that traverses the data in a partition

    • A list of worker nodes ...

Access this course and 1400+ top-rated courses and projects.