With time, the need to store and process big data has increased as traditional frameworks and hardware proved to be insufficient to deal with the massive data surge. In 2008, Yahoo released Hadoop to the market. Hadoop is an open-source framework that is a powerhouse when dealing with big data. It provides storage in the form of distributed file systems and equips users to process data in parallel.
Hadoop has four main modules
Hadoop Common provides the Hadoop framework with the necessary utilities and libraries. It provides an abstraction to basic operations on the filesystem. Moreover, it contains the Java Archive files and scripts to start Hadoop. Hadoop Common primarily supplies the tools to process data stored in the file system.
Hadoop Common provides the Hadoop framework with the necessary utilities and libraries. It provides an abstraction to basic operations on the filesystem. Moreover, it contains the Java Archive files and scripts to start Hadoop. Hadoop Common primarily supplies the tools to process data stored in the file system.
HDFS serves as the storage hub for Hadoop. HDFS abstracts numerous servers storing data into one giant warehouse of data. The abstraction makes the storing and fetching of large amounts of data extremely fluid. Big data is chunked out before it is stored in a distributed file system.
There are two main components to HDFS: NameNode and DataNode. NameNode serves as the master node that stores the location/addresses of the chunks of data stores in different machines/servers. DataNode, however, actually contains the chunked out data. The data in DataNode is replicated on several machines in order to serve as a backup in the event of a node failure.
The MapReduce module operates on two main worker nodes: Map and Reduce. On top of the Map and Reduce workers sits a MasterNode that schedules tasks on the workers. Map workers fetch raw data from the file system, organize it systematically in subsequent partitions, and pass it into a buffer. Reduce workers get the mapped data from the buffer and aggregate it according to a particular format given by the MasterNode. By the end of the procedure, the data is organized into an easily readable and manageable form.
YARN is Hadoop’s resource manager and job scheduler. YARN was introduced in Hadoop 2.0 to separate the resource management layer from the processing layer. YARN thrives on three main components: Client, Resource Manager, and Node Manager: