Resiliency

This lesson explains how failures of various components are handled in the MapReduce framework.

Resiliency

So what is so special about MapReduce? After all, the same problems can be solved on other platforms or supercomputers. The lure of MapReduce is its ability to run on cheap commodity hardware as there’s a high probability of hardware and other infrastructure breaking down. In this lesson, we’ll see how running a MapReduce job handles various failures.

Task failure

A map or a reduce task for a MapReduce job can fail for many reasons: bug in the user’s code, bug in the JVM, insufficient space on the drive, and others. When a task fails, the JVM hosting the task reports the error message to its parent application master before exiting. The ApplicationMaster marks the task attempt as failed. It reschedules it for execution on a different node manager if possible. There is a configurable number of times that map and reduce tasks are allowed to fail before the entire job is marked as failed.

How are long-running tasks handled? Tasks that don’t send a ...