...

Encoders

Learn about Spark encoders and how they help convert types to and from Spark's internal type system to JVM.

We'll cover the following...

Encoders
Cost of using Datasets

Spark, being an in-memory big data processing engine, must make efficient use of memory. Initially, Spark used RDD based Java objects for memory storage, serialization, and deserialization. The objects lived on the Java heap and were expensive to store and manipulate. In addition, the objects are affected by the vagaries of Java’s garbage collector. In Spark 1.x, Tungsten was introduced, which is Spark’s internal representation of objects in a row-based memory layout off of the Java heap. Spark 2.x came with several improvements and optimizations to the Tungsten engine. The memory use of Spark has evolved across its different versions, with each version being more memory-efficient than its predecessor. Before we discuss encoders, let’s refresh the concepts of serialization and deserialization, often abbreviated as SerDe.

Encoding an object as a byte stream is called serializing the object. Once an object is serialized, its encoding can be transmitted from one running virtual machine to another, stored on disk or a ...

Spark Overview

DataFrames

Datasets

Spark SQL

Summary

Encoders