Google’s MapReduce
Learn how to design a system capable of processing gigantic amounts of data with a simple end-programmer interface.
Parallel data processing—the domain of a Jedi programmer!
Parallel computing is known to be difficult, intense, and full of potential minefields in terms of
Google introduced a new programming model (MapReduce) that enabled programmers (of any expertise level) to express their data processing needs as if they were writing sequential code. The MapReduce runtime automatically takes care of the messy details of distributing data and running the jobs in parallel on multiple servers, even under many fault conditions. The widespread use of the MapReduce model proves its applicability to a broad range of data processing problems.
In this lesson, we will study the MapReduce system’s design and programming model.
Design of MapReduce in a nutshell
MapReduce is a
The following illustration depicts how the process works. We will explain the design in the rest of the lesson.
Functional and non-functional requirements
The system must fulfill several functional and non-functional requirements.
Functional requirements: These include data partitioning, parallelization, dynamic load balancing, and fault tolerance. These functional requirements ensure efficient data distribution and fault-resilient processing.
Non-functional requirements: These encompass high throughput, low latency, scalability, reliability, and availability. These features together make MapReduce adaptable to various scenarios.
Programming model
MapReduce’s programming model gets its motivation from Map
and Reduce
. It automatically provides an abstraction for all the:
Internal data distribution mechanisms
Parallelization
Load balancing
Fault tolerance
The Map
function
The Map
function takes a split, in the shape of an <input_key, input_value>
pair, as input and produces intermediate <output_key, intermediate_value>
pairs.
map(input_key, input_value) -> list(output_key, intermediate_value)
Note: The
MapReduce
library consolidates all the intermediate values associated with an output key and passes them to theReduce
function.
The Reduce
function
The Reduce
function accepts an input pair of the form <output_key, list(intermediate_value)>
and returns an output pair of the form <output_key, list(output_value)>
.
reduce(output_key, list(intermediate_value)) -> list(output_value)
Note: It is important to note that the domain of the
Map
input data keys and values typically differ from that of theReduce
output data keys and values. Moreover, the intermediate keys and values—the output ofMap
and input toReduce
—have the same domain.
The MapReduce
specification object
In addition to the Map
and Reduce
functions, the user needs to provide the code for a MapReduce specification object with the names (URLs to the files in GFS) of the input, output files, and the additional tuning parameters. The MapReduce
library passes this specification object to the MapReduce
function at invocation and is responsible for linking all the segments of the user code.
Design details
Let’s discuss the MapReduce design in detail, including estimations, the role of all the components, and the execution flow.
Estimations
In our design, we’ll have to specify the number of workers required to achieve the task based on the input file size.
Estimating the Map
tasks (M)
We can estimate the number of Map
tasks (M) required by using the following formula:
(In this formula, we used 64 MB because it is the size of one chunk on the Google File System.)