Shared Variables in Spark
Learn how Spark makes data sharing and information gathering efficient.
In addition to RDDs, Spark's second abstraction is distributed shared variables. We might want to send static data to all the workers (driver-to-worker information flow) or might want to collect some state from all the workers (workers-to-driver information flow). Spark's shared variable abstraction helps with both of these scenarios.
Shared variables
Setup work is required for some operations, like creating a random number from a specific distribution, for each partition. The user will have to create and send it to the worker with specific partitions every time a task is run on it. Shared variables are used to help cater to the setup overhead. Shared variables can be used to add together data from all tasks or save a large ...