RDD Operations
Learn the basics of RDD operations.
We'll cover the following...
Introduction to RDD operations
There are two types of RDD operations:
-
Transformations: These are RDD operations that create one dataset from another dataset.
-
Actions: These are RDD operations that return a value to the driver program after running a computation on the dataset.
Press + to interact
Let’s understand RDD operations through an example:
Press + to interact
from pyspark import SparkContextsc = SparkContext("local", "RDD Operations Example")print("Create a Python list")data = [1, 2, 3, 4, 5]print("Create an RDD from the Python list")rdd = sc.parallelize(data)print("Apply a map transformation to square each element in the RDD")rdd2 = rdd.map(lambda x: x ** 2)print("Apply a reduce transformation to sum up all the elements in the rdd2 RDD")result = rdd2.reduce(lambda x,y : x+y)print(f'Print final result: {result}')
Let’s understand the code:
- Line 1: Import the
SparkContext
class from thepyspark
module. - Line 2: Create a
SparkContext
with the name “RDD Operations Example.” - Line 5: Create a Python list named
data
with some elements. - Line 8: Use the
parallelize()
method of theSparkContext
to create an RDD from the Python listdata
. Theparallelize()
method distributes the data across the cluster, allowing for parallel processing. The resulting RDD is assigned to the variablerdd
. - Line 11: The
map()
transformation is applied to the RDDrdd
. The Lambda functionlambda x: x ** 2
is used to square each element of the RDD. The resulting RDD,rdd2
, contains the squared values of the original RDD. - Line 14: The
reduce()
transformation is applied to the RDDrdd2
. The Lambda function,lambda x, y: x + y
, is used to sum up the elements of the RDD. Thereduce()