Putting it Together
This lesson explains end to end working of MapReduce.
We'll cover the following...
Putting it Together
Now that we know how MapReduce works, we can now dive into the end to end workflow of a MapReduce job in a Hadoop cluster.
-
In the driver program of our example, you’ll see the job is submitted using the method
waitForCompletion()
.job.waitForCompletion(true);
This method returns when the job has successfully completed. A lot goes on behind the scenes before this method returns. We’ll trace the various steps involved in the execution of a job when submitted to a Hadoop cluster.
-
The class
JobSubmitter
is responsible for talking to the resource manager and retrieving a new application ID used as the ID of the MR job. The class also performs sanity checks, such as verifying if the output path exists and the input splits can be successfully computed. -
Next, the resources for running the job are copied over to HDFS in a staging directory, with the ID of the job in the path. Resources include the jar file, which holds the mapper and reducer code to execute. This file is renamed as job.jar. Configuration files and metadata about input splits is also copied. After the job successfully finishes, the framework deletes this staging directory. You can set the property
mapreduce.task.files.preserve.filepattern
to choose what files to keep for debug purposes.The jar file is replicated across the cluster to be readily available for node managers to access in ...