Parquet: Intro

This lesson introduces the Parquet format used for storing data in columnar format.

We'll cover the following...

Parquet
Data model

Parquet

Parquet literally means a patterned wooden surface. But, in Hadoop world, the term refers to a columnar storage format. Parquet is based on Google’s Dremel paper. It is the love-child of collaboration between Cloudera and Twitter engineers. What sets Parquet apart from other columnar formats is its ability to efficiently store nested data. Deeply nested fields are also stored in a truly columnar fashion and can be read independently of other fields. Let’s see an example. Consider the following record which represents the structure for a car record in JSON:

{
  make        : "",
  model       : "",
  year        : "",
  horsepower  : "",
  stats       : {
                   zeroToSixty: "",

...

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Parquet: Intro

Parquet