Distcp
This lesson talks about the distributed copy tool.
Distcp
Distributed Copy tool, also known as distcp, is one of the important tools of Hadoop. Commonly used in the industry for moving data around, it is as an example of a problem that MapReduce can solve. The Distcp tool allows for parallel processing of files on the same Hadoop cluster or between two Hadoop clusters. It can copy files or directories. Distcp is implemented as a map reduce job with no reduce phase. The mappers run in parallel across the cluster to perform the copy. This reduces the time required to copy the same data sequentially. Each file is copied by one map task; the smallest unit of work for Distcp is a file. If the number of ...