MapReduce: Detailed Design

Let’s discuss the MapReduce design in detail, including estimations and the detailed role of all the components—their internal mechanisms and the execution flow.

Estimations

In our design, we’ll have to specify the number of workers required to achieve the task based on the input file size. Let’s make some assumptions before we formulate the workers’ calculations.

Assumptions

  • The input file size is 12.8 TB12.8\ TB (we assume this data size to simplify our calculations, as we will see shortly.)
  • We will distribute the input data into splits of roughly 16 MB16\ MB to 64 MB64\ MB (we use multiples of 16 MB because the underlying storage of GFS’s chunk size is 64 MB, and we want to maximize data locality benefits).
  • We can distinguish the tasks into two categories based on their executing functions— Map and Reduce.
  • All workers are identical in their memory and computation resources.
  • There are no malfunctioning workers.

Note: In this chapter, we’ll refer to workers as machines. Even though our design automatically handles the differences in workers’ memories and computation resources and the cases of faultiness, these calculations remain independent of these points.

Estimating the Map tasks (M)

We can estimate the number of Map tasks required using the following formula:

Number of map tasks (M)=File size in MBs64 MBNumber\ of\ map\ tasks\ (M) = \frac{File\ size\ in\ MBs}{64\ MB}

Using the file size under the assumptions section, we can estimate the number of Map tasks: M=12800000 MB64 MB=200000M = \frac{12800000\ MB}{64\ MB} = 200000

Workers estimation

Based on the number of Map tasks (M), we can estimate the number of workers (W) as well, making each worker perform 100100 Map tasks.

Number of workers (W)=Number of map tasks100Number\ of\ workers\ (W) = \frac{Number\ of\ map\ tasks}{100}

Using the number from the previous calculation for Map tasks, this estimation is as follows: W=200000100=2000W = \frac{200000}{100} = 2000

Estimating the Reduce tasks (R)

Mainly, the number of Reduce tasks (R) are user-defined as each task produces its separate output file. We usually estimate this number as a small multiple of the available workers in the cluster, 2.52.5 as an example.

Number of reduce tasks (R)=Number of workers available2.5Number\ of\ reduce\ tasks\ (R) = Number\ of\ workers\ available * 2.5

Based on the number of workers above, we can estimate R as follows:

R=20002.5=5000R = 2000 * 2.5 = 5000

You may change the input file size in the following calculator widget to see how it impacts M, worker count, and the R values.

Level up your interview prep. Join Educative to access 80+ hands-on prep courses.