The need to use serialization

Serialization, which can be defined as converting objects into bytes of streams and vice versa (de-serialization), has traditionally been a crucial part of distributed applications, and Spark is no exception.

Once an object is serialized, it can be transmitted over the network to different nodes, usually as streams of bytes. In Spark, serialization occurs when information is being shuffled around between the worker nodes and the node where the driver process executes.

At the same time, serialization can be used by Spark to persist all or part of a DataFrame’s information onto a disk, thus reducing network traffic, memory usage, and increasing performance. In these scenarios, Spark does the heavy lifting for us, meaning it manages the serialization and transmission of rows over the network.

There is, however, a catch. As we’ve learned in previous lessons, the application’s code resides on the driver node, which is so named because it drives or coordinates the distributed execution. So, the following scenario is likely to occur more than once:

1- Transformations in the shape of Mapper, Filter classes, and others implementing Spark functions, are ...

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Serialization: Working through the Wire

The need to use serialization