Serialization: Working through the Wire

Learn distributed computing in Spark and Serialization.

We'll cover the following

The need to use serialization

Serialization, which can be defined as converting objects into bytes of streams and vice versa (de-serialization), has traditionally been a crucial part of distributed applications, and Spark is no exception.

Once an object is serialized, it can be transmitted over the network to different nodes, usually as streams of bytes. In Spark, serialization occurs when information is being shuffled around between the worker nodes and the node where the driver process executes.

At the same time, serialization can be used by Spark to persist all or part of a DataFrame’s information onto a disk, thus reducing network traffic, memory usage, and increasing performance. In these scenarios, Spark does the heavy lifting for us, meaning it manages the serialization and transmission of rows over the network.

There is, however, a catch. As we’ve learned in previous lessons, the application’s code resides on the driver node, which is so named because it drives or coordinates the distributed execution. So, the following scenario is likely to occur more than once:

1- Transformations in the shape of Mapper, Filter classes, and others implementing Spark functions, are applied to Datasets or DataFrames of POJOs.

2- The classes implementing these Spark interfaces representing those functions that run distributedly (like MapFunction, Filter, and more) might hold references or work with collaborator, helper, or stateful objects not belonging to the Spark library.

In these situations, Spark needs to be able to serialize and send these objects from the driver process to the executors processes running on the worker nodes.

Before experimenting with code, let’s illustrate serialization and the scenario described above as it takes place in Spark processing:

Get hands-on with 1400+ tech skills courses.