Search⌘ K

Encoders

Explore how Spark encoders work to serialize and deserialize data efficiently between Spark's internal Tungsten format and JVM objects. Understand the memory advantages of encoders over traditional Java objects, their role in Datasets, and practical tips to minimize performance costs related to serialization.

We'll cover the following...

Spark, being an in-memory big data processing engine, must make efficient use of memory. Initially, Spark used RDD based Java objects for memory storage, serialization, and deserialization. The objects lived on the Java heap and were expensive to store and manipulate. In addition, the objects are affected by the vagaries of Java’s garbage collector. In Spark 1.x, Tungsten was introduced, which is Spark’s internal representation of objects in a row-based memory layout off of the Java heap. Spark 2.x came with several improvements and optimizations to the Tungsten engine. The memory use of Spark has evolved across its different versions, with each version being more memory-efficient than its predecessor. Before we discuss encoders, let’s refresh the concepts of serialization and deserialization, often abbreviated as SerDe.

Encoding an object as a byte stream is called serializing the object. Once an object is serialized, its encoding can be transmitted from one running virtual machine to another, ...