...

/

Parquet: Definition Level

Parquet: Definition Level

This lesson discusses the definition level used in Parquet format.

We'll cover the following...

Definition Level

We already know that, generally, columnar format supports efficient encoding and decoding by storing values that belong to the same column together. Parquet goes a step ahead by storing nested structures/fields as columns too. This is unlike other columnar formats that only store top-level structure/field in a columnar fashion. Parquet needs to map a field from the schema to a flat column on disk, read it back, and then reconstruct the nested data-structure. We’ll use the car example to explain this concept of definition level. The schema for a car record is here:

message Car {
 required string make;
 required int year;
 
 repeated group part {
     required string name;
     optional int life;
     repeated string oem;   
 }
}

Parquet schema specification is a minimalistic version of Google’s Protocol Buffers. This is done with a model similar to Protocol buffers. Nesting is expressed using groups of fields, and repetition using repeated fields. There is no need for complex types like Maps, List or Sets, as they can be mapped to a ...