What is apache Hadoop?

svg viewer

Overview

With time, the need to store and process big data has increased as traditional frameworks and hardware proved to be insufficient to deal with the massive data surge. In 2008, Yahoo released Hadoop to the market. Hadoop is an open-source framework that is a powerhouse when dealing with big data. It provides storage in the form of distributed file systems and equips users to process data in parallel.

Hadoop has four main modules

  1. Hadoop Common
  2. Hadoop Distributed File System (HDFS)
  3. Hadoop MapReduce
  4. Hadoop Yet Another Resource Negotiator (YARN)

1. Hadoop Common

Hadoop Common provides the Hadoop framework with the necessary utilities and libraries. It provides an abstraction to basic operations on the filesystem. Moreover, it contains the Java Archive files and scripts to start Hadoop. Hadoop Common primarily supplies the tools to process data stored in the file system.

1. Hadoop Common

Hadoop Common provides the Hadoop framework with the necessary utilities and libraries. It provides an abstraction to basic operations on the filesystem. Moreover, it contains the Java Archive files and scripts to start Hadoop. Hadoop Common primarily supplies the tools to process data stored in the file system.

svg viewer
svg viewer

2. Hadoop Distributed File System (HDFS)

HDFS serves as the storage hub for Hadoop. HDFS abstracts numerous servers storing data into one giant warehouse of data. The abstraction makes the storing and fetching of large amounts of data extremely fluid. Big data is chunked out before it is stored in a distributed file system.

There are two main components to HDFS: NameNode and DataNode. NameNode serves as the master node that stores the location/addresses of the chunks of data stores in different machines/servers. DataNode, however, actually contains the chunked out data. The data in DataNode is replicated on several machines in order to serve as a backup in the event of a node failure.

3. MapReduce

The MapReduce module operates on two main worker nodes: Map and Reduce. On top of the Map and Reduce workers sits a MasterNode that schedules tasks on the workers. Map workers fetch raw data from the file system, organize it systematically in subsequent partitions, and pass it into a buffer. Reduce workers get the mapped data from the buffer and aggregate it according to a particular format given by the MasterNode. By the end of the procedure, the data is organized into an easily readable and manageable form.

svg viewer
svg viewer

4. Hadoop Yet Another Resource Negotiator (YARN)

YARN is Hadoop’s resource manager and job scheduler. YARN was introduced in Hadoop 2.0 to separate the resource management layer from the processing layer. YARN thrives on three main components: Client, Resource Manager, and Node Manager:

  1. Client forwards jobs to the Resource Manager.
  2. Resource Manager governs all the activities and serves as a master daemon to YARN. It first schedules tasks and allocates resources to the Node Manager and then it connects applications and monitors its health to restart it if the need arises.
  3. Node Manager takes care of the individual node in the YARN architecture. It monitors health, manages​ resources, logs necessary metrics, and reports to the Resource Manager.
Copyright ©2024 Educative, Inc. All rights reserved