Encoders
Learn about Spark encoders and how they help convert types to and from Spark's internal type system to JVM.
We'll cover the following...
Spark, being an in-memory big data processing engine, must make efficient use of memory. Initially, Spark used RDD based Java objects for memory storage, serialization, and deserialization. The objects lived on the Java heap and were expensive to store and manipulate. In addition, the objects are affected by the vagaries of Java’s garbage collector. In Spark 1.x, Tungsten was introduced, which is Spark’s internal representation of objects in a row-based memory layout off of the Java heap. Spark 2.x came with several improvements and optimizations to the Tungsten engine. The memory use of Spark has evolved across its different versions, with each version being more memory-efficient than its predecessor. Before we discuss encoders, let’s refresh the concepts of serialization and deserialization, often abbreviated as SerDe.
Encoding an ...