Resilient Distributed Datasets (RDDs)
Learn the basics of RDDs and their practical utilization.
What is an RDD?
An RDD (Resilient Distributed Dataset) is a fundamental building block of PySpark and is the core data structure. It is a low-level object in PySpark. The name RDD captures three important properties:
-
Resilient: Ability to withstand failures
-
Distributed: Spanning across multiple machines
-
Datasets: Collection of partitioned data, e.g., arrays, tables, tuples, etc.
RDD is a fault-tolerant, immutable, distributed collection of elements that can be operated on in parallel. Once created, we can’t change it, and that’s why it is immutable. Each record in RDD is a logical partition that can be computed on a different cluster and, therefore, distributed. We can think of RDD as a list in Python, except that RDD is distributed across multiple nodes in the cluster. So, RDD can’t be modified, while lists can’t be distributed and must be processed on a single CPU.
RDD features
In addition to main features such as fault-tolerant, immutable, and distributed, RDDs have the following additional features:
-
In-memory computation: RDDs can cache data in memory, allowing faster iterative computations by persisting intermediate results.
-
Lazy evaluation: Transformations on RDDs are lazily evaluated, meaning computations are postponed until an action is triggered, optimizing execution plans.
-
Transformations and actions: RDDs support two types of operations: transformations and actions, which we’ll discuss in the next lesson.
How to create an RDD
There are three main methods of creating an RDD: creating by parallelizing an existing collection, creating from existing datasets, and creating from existing RDDs.
Parallelizing an existing collection
In this method, we use the parallelize()
function on an existing iterable or collection in our driver ...