Resilient Distributed Datasets (RDDs)

This lesson introduces Resilient Distributed Datasets (RDDs), the fundamental data-structure abstraction in Spark.

We'll cover the following...

Resilient Distributed Datasets (RDDs)
Creating RDDs from local collections
Creating RDDs from data sources
Creating RDDs from DataFrames & Datasets

Resilient Distributed Datasets (RDDs)

The fundamental abstraction in Spark is the RDD, short for Resilient Distributed Dataset. It is a read-only (immutable) collection of objects or records, partitioned across the cluster that can be operated on in parallel. A partition can be reconstructed if the hosting node experiences failure. RDDs are a lower-level API; the other two Spark data abstractions namely DataFrames and Datasets compile to an RDD. The constituent records or objects within an RDD are Java, Python, or Scala objects. Anything can be stored in any format in these objects.

RDDs are a low-level API and the Spark authors discourage working directly with them unless the intent is to exercise fine grain control. Using RDDs, you have to trade-off the optimizations and pre-built functionality that comes with structured APIs such as DataFrames and Datasets. For instance, data is compressed and stored in an optimized binary format in case of structured APIs. That must be manually achieved when working with RDDs.

RDD stands for Resilient Distributed Datasets. Let’s examine each property in turn:

Resilient: means an RDD is fault-tolerant and able to recompute missing or damaged partitions due to node failures. This self-healing is ...

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)