Resilient Distributed Datasets

Learn about the Resilient Distributed Datasets (RDDs) that form the building blocks for storing and processing data in Spark.

RDDs

The fundamental abstraction in Spark is the Resilient Distributed Dataset (RDD). It is a read-only (immutable) collection of objects or records, partitioned across the cluster that can be operated on in parallel. A partition can be reconstructed if the node hosting it experiences a failure. RDDs are a lower-level API, and DataFrames and Datasets compile to an RDD. The constituent records or objects within an RDD are Java, Python, or Scala objects. Anything can be stored in any format in these objects.

RDDs are a low-level API, so the Spark authors discourage working directly with them unless we intend to exercise fine grain control. In using RDDs, one sacrifices the optimizations and pre-built functionality that comes with the use of structured APIs such as DataFrames and Datasets. For example, data is compressed and stored in an optimized binary format in case of structured APIs, which has to be manually achieved when working with RDDs.

The following are the properties of RDDs:

  • Resilient: An RDD is fault-tolerant and is able to recompute missing or damaged partitions due to node failures. This self-healing is made possible using an RDD lineage graph that we'll cover in more depth later. Essentially an RDD remembers how it reached its current state and can trace back the steps that got it to its current state to recompute any lost partitions.

  • Distributed: Data making up an RDD is spread across a cluster of machines.

  • Datasets: Representation of the data records we work with. External Data can be loaded using a variety of sources such as JSON file, CSV file, text file, or database via JDBC.

Spark RDDs can be cached and manually partitioned. Caching is useful as it allows repeated use of RDDs. Manual partitioning helps correctly balance partitions. Next, we'll jump right into code and see a few examples of creating RDDs.

Creating RDDs from local collections

The simplest way to create an RDD is from an existing collection using the parallelize(…) method exposed by the SparkContext. This method accepts a collection and copies the elements to form a distributed data set that can be operated on in parallel. Below is a sample snippet of code to create an RDD from a collection of strings.

// Create a local list
val brands = List("Tesla", "Ford", "GM")

// Create a distributed dataset from the local list as an RDD
val brandsRDD = sc.parallelize(brands)

Note that the parallelize(...) method also accepts a second parameter which allows the user to define the number of partitions. This setup is shown in the screenshot below:


Creating RDDs from data sources

We can also create RDDs from data sources such as text files. However, it is better to read in data sources as DataFrames or Datasets but we’ll cover that later. To read in a file as an RDD, we’ll need to use the SparkContext as shown below:

// Create a RDD by reading a file from the local filesystem
val data = sc.textFile("/data/cars.data")

// Print the count of the number of records in the RDD in spark-shell console
data.count

Each line in the text file is stored as a record in the RDD. We can also read the entire file as a single record using the method wholeTextFiles(...).

// Read the text file as a single record
val data = sc.wholeTextFiles("/data/cars.data")

// Inspect the number of records in the rdd which should be one now
data.count

In this instance we read a file from the local filesystem, but Spark offers a much wider array of data sources to choose from, including HDFS, HBase, Cassandra, Amazon S3, and more. Spark supports text files, sequence files, and any other Hadoop InputFormat.

Creating RDDs from DataFrames & Datasets

One of the easiest ways to create an RDD is from an existing DataFrame or Dataset. So far, we haven't covered DataFrames or Datasets but we'll share an example nonetheless.

// Create a DataFrame
val dataFrame = spark.range(100).toDF()

// Create a RDD from the DataFrame
val rdd = dataFrame.rdd

// Inspect the number of records in the rdd
rdd.count

The examples discussed are reproduced in the widget below for easy copy-pasting into the terminal. Click the terminal to connect and execute the command spark-shell to fire up the Spark console.

Get hands-on with 1400+ tech skills courses.