MapReduce: Evaluation
Let's see how well the MapReduce design meets its requirements.
We'll cover the following...
Evaluation
The use cases of our system span various problems, ranging from processing raw data to gaining insights from the derived data, but we can evaluate our design to fulfill the requirements for general use.
The MapReduce
library inherently handles the mechanisms of automatic parallelization, data splitting, load balancing, and fault tolerance. Let’s see how our design fulfills additional non-functional requirements.
Fault tolerance
Since we deal with large datasets, fault tolerance is critical. Faults can happen at any stage or component. Let’s see each one in detail.
-
Worker failure: Let’s consider the scenario of a failed worker. The manager identifies any worker not responding to the periodic calls as a failure. Once the manager declares a worker as a failure, it reschedules all of its completed and in-progress tasks to another available worker (if the actual disk or server fails and is unreachable, and
Reduce
tasks haven’t fetched out completed map outputs yet, then the manager will need to get them rescheduled. If only the worker process fails, the manager will only need to reschedule the in-progress work). When the reassigned work is completed, the manager notifies all the reducers working on the processed data of the failed mapper and assigns them the new mapper address to fetch data. The manager also wipes down all the processed data by the failed mapper from its local disk to free up space.In case of a reducer failure, the manager reassigns its failed tasks to another reducer and provides the new reducer with all the required information for data fetching.
-
Manager failure: Remember that when a MapReduce job starts, one worker takes on the role of a manager. Each user-spawned job will have its own manager. Therefore, the failure of a manager will only have a limited impact (on a specific user job). We should note that many MapReduce jobs are prolonged (for example, processing a crawl of the WWW), and manager failure can impact them badly. The
MapReduce
library does not deal with the manager’s failure and leaves it to the end users.Our current implementation stops the MapReduce job once the manager fails. However, we can implement a fail-safe option by making the manager save its snapshots periodically and revert to the latest one in case of a failure (thereby reducing the impact of such failures).
-
Bad records: As discussed earlier, our design handles bad records by skipping them from the re-executions.
By handling all these situations, our system ensures fault tolerance. Let’s go through the semantics in case of failures.
Semantics in case of failures
We can further divide these semantics based on the deterministic or non-deterministic nature of the Map
or Reduce
functions.
Deterministic functions
In the case of the deterministic nature of the user-defined Map
and Reduce
functions, the output of our distributed implementation is similar to their non-faulting sequential execution. It provides a strong baseline for the users to predict and understand the behavior of their program.
The program achieves this property by relying on the atomic nature of the commits made by the Map
or Reduce
tasks. Both these tasks write their outputs to temporary files and GFS and change the names once entirely generated.
- A
Map
task writes its output to