Resilient Distributed Datasets (RDDs)
Learn the basics of RDDs and their practical utilization.
What is an RDD?
An RDD (Resilient Distributed Dataset) is a fundamental building block of PySpark and is the core data structure. It is a low-level object in PySpark. The name RDD captures three important properties:
-
Resilient: Ability to withstand failures
-
Distributed: Spanning across multiple machines
-
Datasets: Collection of partitioned data, e.g., arrays, tables, tuples, etc.
RDD is a fault-tolerant, immutable, distributed collection of elements that can be operated on in parallel. Once created, we can’t change it, and that’s why it is immutable. Each record in RDD is a logical partition that can be computed on a different cluster and, therefore, distributed. We can think of RDD as a list in Python, except that RDD is distributed across multiple nodes in the cluster. So, RDD can’t be modified, while lists can’t be distributed and must be processed on a single CPU.
Get hands-on with 1400+ tech skills courses.