System Design: MapReduce
Let's introduce a MapReduce system for processing big data on multiple machines.
Parallel data processing—the domain of a Jedi programmer!
Parallel computing is known to be difficult, intense, and full of potential minefields in terms of
Google introduced a new programming model (MapReduce) that enabled programmers (of any expertise level) to express their data processing needs as if they were writing sequential code. The MapReduce runtime automatically takes care of the messy details of distributing data and running the jobs in parallel on multiple servers, even under many fault conditions. The widespread use of the MapReduce model proves its applicability to a broad range of data processing problems.
In this chapter, we will study the design of the MapReduce system in detail and the application of its programming model.
Motivation
Information extracted from different kinds of data sources plays an increasingly integral part in our society. Some examples include collecting, summarizing, and indexing various data from the World Wide Web for efficient searches and detecting anomalies in outbound data from some organizations.
As the volume of one data source increases, there is also a growing variety of new data sources available. Recent trends show that the data increase rate far exceeds the available computational resources needed to transform raw data into actionable information.
The graph below shows the trendline of processing power, data growth, ...