MapReduce is a framework developed by Google to handle large amounts of data in a timely and efficient manner. One of the most famous software frameworks that uses MapReduce is Apache Hadoop MapReduce.
MapReduce takes advantage of numerous servers where data can be distributed and managed. Like every good framework, MapReduce provides abstractions to underlying processes happening during the execution of user commands. A few of these processes include fault tolerance, partitioning data, and aggregating data. The abstractions let the user focus on the high-level logic of the program while trusting that the framework will smoothly continue the processes under-the-hood.
The workflow that MapReduce follows is:
There are several Map Workers and Reduce Workers, but there is only one Master Node. The Master Node tells the Map and Reduce Workers what to do.
The data is usually in the form of a big chunk. It is necessary to, first, partition the data into smaller, more manageable pieces that can be efficiently handled by the map workers.
Map Workers receive the data in the form of a <key, value>
(key is filename and value is content) pair. This data is processed by the Map Workers, according to the user-defined Map Function, to generate intermediate <key, value>
pairs.
The data is partitioned into R partitions (R is the number of Reduce Workers). These files are buffered in the memory until the Master Node forwards them to the Reduce Workers.
As soon as the Reduce Workers get the data stored in the buffer, they sort it accordingly and group data with the same keys.
The Master Node is notified when the Reduce Workers are done with their tasks. In the end, the sorted data is aggregated together and R output files are generated for the user.