...

/

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

This lesson introduces Resilient Distributed Datasets (RDDs), the fundamental data-structure abstraction in Spark.

Resilient Distributed Datasets (RDDs)

The fundamental abstraction in Spark is the RDD, short for Resilient Distributed Dataset. It is a read-only (immutable) collection of objects or records, partitioned across the cluster that can be operated on in parallel. A partition can be reconstructed if the hosting node experiences failure. RDDs are a lower-level API; the other two Spark data abstractions namely DataFrames and Datasets compile to an RDD. The constituent records or objects within an RDD are Java, Python, or Scala objects. Anything can be stored in any format in these objects.

RDDs are a low-level API and the Spark authors discourage working directly with them unless the intent is to exercise fine grain control. Using RDDs, you have to trade-off the optimizations and pre-built functionality that comes with structured APIs such as DataFrames and Datasets. For instance, data is compressed and stored in an optimized binary format in case of structured ...

Access this course and 1400+ top-rated courses and projects.