Resilient Distributed Datasets
Learn about the Resilient Distributed Datasets (RDDs) that form the building blocks for storing and processing data in Spark.
We'll cover the following...
RDDs
The fundamental abstraction in Spark is the Resilient Distributed Dataset (RDD). It is a read-only (immutable) collection of objects or records, partitioned across the cluster that can be operated on in parallel. A partition can be reconstructed if the node hosting it experiences a failure. RDDs are a lower-level API, and DataFrames and Datasets compile to an RDD. The constituent records or objects within an RDD are Java, Python, or Scala objects. Anything can be stored in any format in these objects.
RDDs are a low-level API, so the Spark authors discourage working directly with them unless we intend to exercise fine grain control. In using RDDs, one sacrifices the optimizations and ...