Resilient Distributed Datasets (RDDs)

What is an RDD?

An RDD (Resilient Distributed Dataset) is a fundamental building block of PySpark and is the core data structure. It is a low-level object in PySpark. The name RDD captures three important properties:

  • Resilient: Ability to withstand failures

  • Distributed: Spanning across multiple machines

  • Datasets: Collection of partitioned data, e.g., arrays, tables, tuples, etc.

RDD is a fault-tolerant, immutable, distributed collection of elements that can be operated on in parallel. Once created, we can’t change it, and that’s why it is immutable. Each record in RDD is a logical partition that can be computed on a different cluster and, therefore, distributed. We can think of RDD as a list in Python, except that RDD is distributed across multiple nodes in the cluster. So, RDD can’t be modified, while lists can’t be distributed and must be processed on a single CPU.

Get hands-on with 1400+ tech skills courses.