MapReduce's Manager-Worker Architecture
Let's study the architecture of MapReduce and the guarantees provided by it.
We'll cover the following...
The manager node is responsible for scheduling tasks for worker nodes and managing their execution, as shown in the following illustration:
Apart from the definition of the map
and reduce
functions, the user can also specify the number M
of map
tasks, the number R
of reduce
tasks. MapReduce can also specify the number of input or output files, and a partitioning function that defines how key-value pairs from the map
tasks are partitioned before being processed by the reduce
tasks. By default, a hash partitioner is used that selects a reduce
task using the formula hash(key) mod R
.
Steps for the execution of MapReduce
The execution of MapReduce proceeds in the following way:
-
The framework divides the input files into
M
pieces, called input splits, typically between 16 and 64 MB per split. -
It then starts an instance of a manager node and multiple worker node instances on an existing ...