RDD Operations

Learn the basics of RDD operations.

Introduction to RDD operations

There are two types of RDD operations:

  • Transformations: These are RDD operations that create one dataset from another dataset.

  • Actions: These are RDD operations that return a value to the driver program after running a computation on the dataset.

Press + to interact
RDD operations
RDD operations

Let’s understand RDD operations through an example:

Press + to interact
from pyspark import SparkContext
sc = SparkContext("local", "RDD Operations Example")
print("Create a Python list")
data = [1, 2, 3, 4, 5]
print("Create an RDD from the Python list")
rdd = sc.parallelize(data)
print("Apply a map transformation to square each element in the RDD")
rdd2 = rdd.map(lambda x: x ** 2)
print("Apply a reduce transformation to sum up all the elements in the rdd2 RDD")
result = rdd2.reduce(lambda x,y : x+y)
print(f'Print final result: {result}')

Let’s understand the code:

  • Line 1: Import the SparkContext class from the pyspark module.
  • Line 2: Create a SparkContext with the name “RDD Operations Example.”
  • Line 5: Create a Python list named data with some elements.
  • Line 8: Use the parallelize() method of the SparkContext to create an RDD from the Python list data. The parallelize() method distributes the data across the cluster, allowing for parallel processing. The resulting RDD is assigned to the variable rdd.
  • Line 11: The map() transformation is applied to the RDD rdd. The Lambda function lambda x: x ** 2 is used to square each element of the RDD. The resulting RDD, rdd2, contains the squared values of the original RDD.
  • Line 14: The reduce() transformation is applied to the RDD rdd2. The Lambda function, lambda x, y: x + y, is used to sum up the elements of the RDD. The reduce()