Encoders
Learn about Spark encoders and how they help convert types to and from Spark's internal type system to JVM.
We'll cover the following
Spark, being an in-memory big data processing engine, must make efficient use of memory. Initially, Spark used RDD based Java objects for memory storage, serialization, and deserialization. The objects lived on the Java heap and were expensive to store and manipulate. In addition, the objects are affected by the vagaries of Java’s garbage collector. In Spark 1.x, Tungsten was introduced, which is Spark’s internal representation of objects in a row-based memory layout off of the Java heap. Spark 2.x came with several improvements and optimizations to the Tungsten engine. The memory use of Spark has evolved across its different versions, with each version being more memory-efficient than its predecessor. Before we discuss encoders, let’s refresh the concepts of serialization and deserialization, often abbreviated as SerDe.
Encoding an object as a byte stream is called serializing the object. Once an object is serialized, its encoding can be transmitted from one running virtual machine to another, stored on disk or a database, sent over the wire to another computer, and more. The receiving entity can reconstruct the object from the byte stream, a process known as deserialization, which is the reverse of serialization.
Encoders
Encoders are the magic behind the conversion of data in off-heap memory from Spark’s internal Tungsten format to JVM Java objects. You may wonder why there is a need to create Dataset and DataFrame objects using Spark’s internal format and not just as JVM objects. The reasons are:
- JVM objects come with a lot of overhead and include data like header info, hashcode, and Unicode info, all of which is unnecessary for working with DataFrames and Datasets. Spark’s internal Tungsten binary format stores objects off the Java heap memory in a way requiring minimal space. For example, consider storing an object of class consisting of an integer and two string fields. If the object values for an instance of the class are
(791, "data", "expert")
they get stored as follows:
Get hands-on with 1400+ tech skills courses.