Resilient Distributed Datasets

Learn about the Resilient Distributed Datasets (RDDs) that form the building blocks for storing and processing data in Spark.

We'll cover the following...

RDDs

The fundamental abstraction in Spark is the Resilient Distributed Dataset (RDD). It is a read-only (immutable) collection of objects or records, partitioned across the cluster that can be operated on in parallel. A partition can be reconstructed if the node hosting it experiences a failure. RDDs are a lower-level API, and DataFrames and Datasets compile to an RDD. The constituent records or objects within an RDD are Java, Python, or Scala objects. Anything can be stored in any format in these objects.

RDDs are a low-level API, so the Spark authors discourage working directly with them unless we intend to exercise fine grain control. In using RDDs, one sacrifices the optimizations and pre-built functionality that comes with the use of structured APIs such as DataFrames and Datasets. For example, data is compressed and stored in an optimized binary format in case of structured APIs, which has to be manually achieved when working with RDDs.

The following are the properties of RDDs:

Resilient: An RDD is fault-tolerant and is able to recompute missing or damaged partitions due to node failures. This self-healing is made possible using an RDD lineage graph that we'll cover in more depth later. Essentially an RDD remembers how it reached its current state and can trace back the steps that got it to its current state ...