...

/

MapReduce in Batch Processing

MapReduce in Batch Processing

Learn about the MapReduce programming model.

In this lesson, we will learn a popular algorithm that is used frequently to do batch processing on a huge volume of data. Google published this algorithm in 2004, and it was later adopted in many data processing systems, such as Apache Spark.

The MapReduce algorithm

We’ll first look at this algorithm with an example. First, let’s imagine the following scenario:

  • You have all the text of a piece of classic English literature.
  • You want to count the occurrence of each word in the whole text.
  • The data is stored in some persistent storage.
  • The data is so huge that it cannot be loaded in memory in one physical machine. This means you have to use multiple machines.

Given the above, let’s discuss an approach to count the words.

Step 1: split the data

First, the data must be split into chunks so that multiple machines can load individual chunks in themselves. As we mentioned, the data itself is so huge that it is way beyond the memory of a single machine. After the data is split, multiple machines can load the chunks and process them independently. Generally, these machines are the mapper machines that load one or more of the individual chunks and run the next step of the algorithm. ...