Hadoop ecosystem#
The Hadoop ecosystem is a suite of services we can use to work with big data initiatives. The four main elements of the ecosystem include:
- MapReduce
- Hadoop Distributed File System (HDFS)
- Yet Another Resource Negotiator (YARN)
- Hadoop Common
Let’s take a closer look at each of these services.
MapReduce#
Hadoop MapReduce is a programming model used for distributed computing. With this model, we can process large amounts of data in parallel on large clusters of commodity hardware. With MapReduce, we can use Map and Reduce. With Map, we can convert a set of data into tuples (key/value pairs). Reduce takes the output of Map as input and combines the tuples into smaller sets of tuples. MapReduce makes it easy to scale data processing to run tens of thousands of machines in a cluster.
During MapReduce jobs, Hadoop sends the tasks to their respective servers in the cluster. When the tasks are completed, the clusters collect and reduce data into a result and send the result back to the Hadoop server.
Hadoop Distributed File System (HDFS)#
As the name suggests, HDFS is a distributed file system. It handles large sets of data and runs on commodity hardware. HDFS helps us scale single Hadoop clusters to multiple nodes, and it helps us perform parallel processing. The built-in servers, NameNode and DataNode, help us check the status of our clusters. HDFS is designed to be highly fault-tolerant, portable, and cost-effective.
Yet Another Resource Negotiator (YARN)#
Hadoop YARN is a cluster resource management and job scheduling tool. YARN also works with the data we store in HDFS, allowing us to perform tasks such as:
- Graph processing
- Interactive processing
- Stream processing
- Batch processing
It dynamically allocates resources and schedules application processing. YARN supports MapReduce, along with multiple other processing models. It efficiently utilizes resources and is backward compatible, meaning that it can run on previous Hadoop versions without any issues.
Hadoop Common#
Hadoop Common, also known as Hadoop Core, provides Java libraries that we can use across all of our Hadoop modules.