Mapper Input

This lesson explains concepts relating to input data for map tasks.

We'll cover the following...

Mapper Inputs

Input splits

Our example demonstrates a simplistic scenario where the input is entirely contained in a single file for the MR job. In reality, the input to a MR job usually consists of several GBs of data. That data is split among multiple map tasks.

Each map task works on a unit of data called the input split.

Hadoop divides the MR job input into equal sized chunks. Each map task works on one chunk - the input split. A user can tweak the size of the input split. As a corollary, the number of map tasks spawned for a MR job is equal to the number of input splits.

However, we walk a fine balance when working with input splits. A greater number of input splits means more map tasks for a MR job. In turn, the MR job processes faster because the map tasks work in parallel on individual splits. However, too many input splits come with a corresponding increase in the overhead for ...