Parquet: Definition Level
This lesson discusses the definition level used in Parquet format.
We'll cover the following
Definition Level
We already know that, generally, columnar format supports efficient encoding and decoding by storing values that belong to the same column together. Parquet goes a step ahead by storing nested structures/fields as columns too. This is unlike other columnar formats that only store top-level structure/field in a columnar fashion. Parquet needs to map a field from the schema to a flat column on disk, read it back, and then reconstruct the nested data-structure. We’ll use the car example to explain this concept of definition level. The schema for a car record is here:
message Car {
required string make;
required int year;
repeated group part {
required string name;
optional int life;
repeated string oem;
}
}
Parquet schema specification is a minimalistic version of Google’s Protocol Buffers. This is done with a model similar to Protocol buffers. Nesting is expressed using groups of fields, and repetition using repeated fields. There is no need for complex types like Maps, List or Sets, as they can be mapped to a combination of repeated fields and groups. The root of the schema is a group of fields called a message. Each field has three attributes: a repetition, type and name. The type of a field is either group or a primitive type (int, float, boolean, string) and the repetition can be one of the three following cases:
- required: exactly one occurrence
- optional: 0 or 1 occurrence
- repeated: 0 or more occurrences
More details about Parquet’s schema specification can be found in the official Github repo here.
One ideas behind Parquet is representing the schema as a tree, where the leaves of the tree always represent primitive types. In our example, the root of the tree is Car and the leaf elements are the primitive fields defined in the schema. The path from the root to any leaf element captures the nesting of structures within the schema.
Get hands-on with 1400+ tech skills courses.