How to create an RDD using parallelize() in pyspark

The parallelize() method of the spark context is used to create a Resilient Distributed Dataset (RRD) from an iterable or a collection.

Syntax

sparkContext.parallelize(iterable, numSlices)

Parameters

iterable: This is an iterable or a collection from which an RDD has to be created.
numSlices: This is an optional parameter that indicates the number of slices to cut the RDD into. The number of slices can be manually provided by setting this parameter. Otherwise, the spark will set this to the default parallelism that is inferred from the cluster.

Return value

This method returns an RDD.

Code example

Let’s look at the code below:

main.py

log4j.properties

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('educative-answers').config("spark.some.config.option", "some-value").getOrCreate()
collection = [("James","Smith","USA","CA"),
    ("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),
    ("Maria","Jones","USA","FL")
  ]
sc = spark.sparkContext
rdd = sc.parallelize(collection)
rdd_elements = rdd.collect()
print("RDD with default slices - ", rdd_elements)
print("Number of partitions - ", rdd.getNumPartitions())
print("-" * 8)
numSlices = 8
rdd = sc.parallelize(collection, numSlices)
rdd_elements = rdd.collect()
print("RDD with default slices - ", rdd_elements)
print("Number of partitions - ", rdd.getNumPartitions())

Code explanation

Line 4: A spark session with the app name educative-answers is created.
Line 6-10: The collection (or iterable) is defined.
Line 12: The spark context object is obtained from the spark session.
Line 14: An RDD is constructed from the collection using the parallelize() method. Here, the number of slices is set by the spark.
Lines 16 and 28: The elements of the RDD are retrieved using the collect() method as an RDD is distributed in nature.
Lines 18 and 30: The elements of the RDD are printed.
Lines 20 and 32: The number of partitions of the created RDD is retrieved by getNumPartitions().
Line 24: The number of slices is defined.
Line 26: An RDD is constructed from the collection using the parallelize() method. Here, the number of slices is set by us.

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

TRENDING TOPICS

Learn to Code

Tech Interview Prep

Generative AI

Data Science

Machine Learning

GitHub Students Scholarship

Early Access Courses

Blind 75

Layoffs

Pricing

For Individuals

Try for Free

Gift a Subscription

CONTRIBUTE

Become an Author

Become an Affiliate

Earn Referral Credits

RESOURCES

Blog

Cheatsheets

Webinars

Answers

ABOUT US

Our Team

Careers

Hiring

Frequently Asked Questions

Press

LEGAL

Cookie Policy

Business Terms of Service

Data Processing Agreement

INTERVIEW PREP COURSES

Grokking the Modern System Design Interview

Grokking the Product Architecture Design Interview

Grokking the Coding Interview Patterns

Machine Learning System Design