Serialization: Working through the Wire
Learn distributed computing in Spark and Serialization.
We'll cover the following...
The need to use serialization
Serialization, which can be defined as converting objects into bytes of streams and vice versa (de-serialization), has traditionally been a crucial part of distributed applications, and Spark is no exception.
Once an object is serialized, it can be transmitted over the network to different nodes, usually as streams of bytes. In Spark, serialization occurs when information is being shuffled around between the worker nodes and the node where the driver process executes.
At the same time, serialization can be used by Spark to persist all or part of a DataFrame’s information onto a disk, thus reducing network traffic, memory usage, and increasing performance. In these scenarios, Spark does the heavy lifting for us, meaning it manages the serialization and transmission of rows over the network.
There is, however, a catch. As we’ve learned in previous lessons, the application’s code resides on the driver node, which is so named because it drives or coordinates the distributed execution. So, the following scenario is likely to occur more than once:
1- ...