Broadcast Variables and PySpark Accumulators
Learn how to efficiently share read-only data and aggregate results in a distributed manner.
Broadcast variables and Accumulators are powerful features in PySpark that enable efficient sharing of read-only data across all nodes in a cluster and aggregating results in a distributed manner, respectively. Let’s understand each of these two important concepts in this lesson.
Broadcast variables
The broadcast variables in PySpark are read-only shared variables that are cached and made available on all nodes in a cluster to access or use by the tasks. They are used to efficiently share large, read-only data across distributed tasks to reduce communication costs. Instead of sending the data along with every task, PySpark broadcasts the variable to the worker nodes, allowing tasks to access the cached data locally.
How do PySpark broadcast variables work?
- PySpark breaks the job into stages with distributed shuffling, and actions are executed within each stage.
- The later stages are further broken down into tasks.
- PySpark broadcasts the common data required by tasks within each stage.
- The broadcast data is serialized and cached on each worker node, and it’s deserialized before executing each task.
Creating a PySpark broadcast variable
To create a PySpark broadcast variable, we can use the SparkContext object’s ...