What is an RDD?

An RDD (Resilient Distributed Dataset) is a fundamental building block of PySpark and is the core data structure. It is a low-level object in PySpark. The name RDD captures three important properties:

Resilient: Ability to withstand failures
Distributed: Spanning across multiple machines
Datasets: Collection of partitioned data, e.g., arrays, tables, tuples, etc.

RDD is a fault-tolerant, immutable, distributed collection of elements that can be operated on in parallel. Once created, we can’t change it, and that’s why it is immutable. Each record in RDD is a logical partition that can be computed on a different cluster and, therefore, distributed. We can think of RDD as a list in Python, except that RDD is distributed across multiple nodes in the cluster. So, RDD can’t be modified, while lists can’t be distributed and must be processed on a single CPU.

Press + to interact

RDD features

In addition to main features such as fault-tolerant, immutable, and distributed, RDDs have the following additional features:

In-memory computation: RDDs can cache data in memory, allowing faster iterative computations by persisting intermediate results.
Lazy evaluation: Transformations on RDDs are lazily evaluated, meaning computations are postponed until an action is triggered, optimizing execution plans.
Transformations and actions: RDDs support two types of operations: transformations and actions, which we’ll discuss in the next lesson.

Introduction to the Course

Introduction to Big Data

Exploring PySpark Core and RDDs

PySpark DataFrames and SQL

Customer Churn Analysis Using PySpark

Machine Learning with PySpark

Modeling with PySpark MLlib

Predicting Diabetes in Patients Using PySpark MLlib

Performance Optimization in PySpark

PySpark Optimization: Analyzing NYC Restaurants Data

Integrating PySpark with Other Big Data Tools

Wrap Up

Apriori Algorithm for Finding Frequent Itemsets with PySpark

Resilient Distributed Datasets (RDDs)

What is an RDD?

RDD features

How to create an RDD

Parallelizing an existing collection