Mapper Input

This lesson explains concepts relating to input data for map tasks.

Mapper Inputs

Input splits

Our example demonstrates a simplistic scenario where the input is entirely contained in a single file for the MR job. In reality, the input to a MR job usually consists of several GBs of data. That data is split among multiple map tasks.

Each map task works on a unit of data called the input split.

Hadoop divides the MR job input into equal sized chunks. Each map task works on one chunk - the input split. A user can tweak the size of the input split. As a corollary, the number of map tasks spawned for a MR job is equal to the number of input splits.

However, we walk a fine balance when working with input splits. A greater number of input splits means more map tasks for a MR job. In turn, the MR job processes faster because the map tasks work in parallel on individual splits. However, too many input splits come with a corresponding increase in the overhead for managing splits and creating map tasks. After a point, the management overhead becomes the dominating factor in job execution. Generally, the HDFS block size (default 128 MB) is considered a good split size. The maximum and minimum split sizes can be adjusted through configuration.

Remember that the input split is a logical concept and represented by a Java abstract class InputFormat. The class definition is taken from the code base:

public abstract class InputSplit {
    public InputSplit() {
    }

    public abstract long getLength() throws IOException, InterruptedException;

    public abstract String[] getLocations() throws IOException, InterruptedException;

    @Evolving
    public SplitLocationInfo[] getLocationInfo() throws IOException {
        return null;
    }
}

Input split doesn’t contain the data, but rather a reference to the data. The MapReduce framework tries to schedule map tasks as close as possible to the split’s data.

Consider our car counting example. Each line in the data file represents a record. An input split, comprised of HDFS blocks, are made up of records. It is possible that a single record may span multiple HDFS blocks. There’s no guarantee that a single HDFS block will consist of a whole number of records. HDFS has no notion of inside contents of a HDFS block. It cannot determine if a record spills over into another block. This problem is solved by input splits. When a MapReduce job client calculates the input splits, it figures out where the first whole record in a block begins and where the last record in the block ends. In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record. The consequence: the map task processing such an input split may read a remote data block stored on a different node in the cluster in order to completely process the input split.

Get hands-on with 1300+ tech skills courses.