Parallel data processing—the domain of a Jedi programmer!

Parallel computing is known to be difficult, intense, and full of potential minefields in terms of HeisenbugsA software bug that disappears or changes its behavior when we attempt to investigate or resolve it.. Combined with the rise of many-core servers and distributed computing, parallel computing is useful and can’t be ignored by regular programmers to speed up their applications or left to Jedi programmers.

Google introduced a new programming model (MapReduce) that enabled programmers (of any expertise level) to express their data processing needs as if they were writing sequential code. The MapReduce runtime automatically takes care of the messy details of distributing data and running the jobs in parallel on multiple servers, even under many fault conditions. The widespread use of the MapReduce model proves its applicability to a broad range of data processing problems.

In this lesson, we will study the MapReduce system’s design and programming model.

Design of MapReduce in a nutshell

MapReduce is a restricted programming modelA programming model where the programmer is expected to adhere to a specific way of programming but, in return, gets some benefits. to process large datasets (big data) effectively and efficiently, structured or unstructured alike, on a cluster of machines. One of the model’s restrictions is that the input and output of the processing code should be in key-value pairs. It takes input data from key-value pairs and aggregates user-defined processed output as different key-value pairs. The input is a large and complex dataset, and the computations get distributed across numerous machines to finish the processing task in a reasonable amount of time.

The following illustration depicts how the process works. We will explain the design in the rest of the lesson.

Programming model

MapReduce’s programming model gets its motivation from LISP’sA programming language developed in 1958. map and reduce operations and requires two user-defined functions, Map and Reduce. It automatically provides an abstraction for all the:

Internal data distribution mechanisms
Parallelization
Load balancing
Fault tolerance

The `Map` function

The Map function takes a split, in the shape of an <input_key, input_value> pair, as input and produces intermediate <output_key, intermediate_value> pairs.

Note: It is important to note that the domain of the Map input data keys and values typically differ from that of the Reduce output data keys and values. Moreover, the intermediate keys and values—the output of Map and input to Reduce—have the same domain.

The `MapReduce` specification object

In addition to the Map and Reduce functions, the user needs to provide the code for a MapReduce specification object with the names (URLs to the files in GFS) of the input, output files, and the additional tuning parameters. The MapReduce library passes this specification object to the MapReduce function at invocation and is responsible for linking all the segments of the user code.

Design details

Let’s discuss the MapReduce design in detail, including estimations, the role of all the components, and the execution flow.

Estimations

In our design, we’ll have to specify the number of workers required to achieve the task based on the input file size.

Estimating the `Map` tasks (M)

We can estimate the number of Map tasks (M) required by using the following formula:

(In this formula, we used 64 MB because it is the size of one chunk on the Google File System.)

$N u m b e r o f m a p t a s k s$ ...

Google’s MapReduce

Parallel data processing—the domain of a Jedi programmer!

Design of MapReduce in a nutshell

Functional and non-functional requirements

Programming model

The `Map` function

The `Reduce` function

The `MapReduce` specification object

Design details

Estimations

Estimating the `Map` tasks (M)

Google’s MapReduce

Parallel data processing—the domain of a Jedi programmer!

Design of MapReduce in a nutshell

Functional and non-functional requirements

Programming model

The Map function

The Reduce function

The MapReduce specification object

Design details

Estimations

Estimating the Map tasks (M)

The `Map` function

The `Reduce` function

The `MapReduce` specification object

Estimating the `Map` tasks (M)