Sequence File: Intro

This lesson explains the structure of a sequence file.

Sequence File: Intro

Apart from supporting text formats, Hadoop also supports binary formats. The sequence file is one of them. Binary data takes up less disk space than textual data. The temporary data output by map tasks is stored as sequence files.

One of the problems Hadoop faces is storing lots of small files. The Namenode runs short on memory if the system has too many small files. Similarly, if the input to a map-reduce job consists of numerous small files, then the number of mapper tasks (one per file) will be significantly more than if there were fewer, larger files. To overcome these issues, sequence files were created. They serve the following two purposes:

  • Can be used as a persistent data-structure to store binary key-value pairs.
  • Can be used as a container for smaller files. In the case of storing small files as a sequence file, the names of the files become the key and the value their content.

Sequence files are well-supported within the Hadoop ecosystem but have little support outside of Java.

Composition of a record

A sequence file is internally arranged like this:

Get hands-on with 1400+ tech skills courses.