Basics
This lesson introduces MapReduce paradigm to the reader.
We'll cover the following...
Map and Reduce
MapReduce is a concatenation of, “map” and “reduce” which aptly describes the two phases it comprises. MapReduce is an implementation of the computing model introduced by Google. Here, data-parallel computations are executed on clusters of unreliable machines by certain systems. These systems automatically provide locality-aware scheduling, fault tolerance, and load balancing. In simpler terms, think of MapReduce similar as a divide and conquer strategy. A huge data set is divided among worker machines. Once processing is complete, the data from each machine is aggregated to present a final solution. The data flow in various phases of a MapReduce job is shown below.
MapReduce is a programming model used to process large data sets on a cluster of commodity machines by using a distributed algorithm.
For all its strengths, MapReduce is fundamentally a batch processing system and is not suitable for interactive analysis. You can’t run a query and get results back quickly. Queries typically take minutes or more, so it’s best for offline use, when there ...