AWS EMR
Learn about the architecture of Amazon EMR and how it helps in data processing.
Amazon EMR (previously called Elastic MapReduce) is a cloud-based service offered by Amazon Web Services (AWS) that helps us process and analyze large amounts of data. It simplifies running big data frameworks like Hadoop and Spark on AWS for data processing and analysis. It's a managed service, so it removes the complexity of managing the big data infrastructure, i.e., it scales processing power based on data volume, and we only pay for what we use. In this lesson, we will learn about the features of EMR and how it works.
Amazon EMR cluster
The core processing unit of the Amazon EMR cluster is the cluster. It’s a group of Amazon EC2 instances working together as a single compute resource. Each instance is called a node. These nodes can be categorized into different types depending on the roles they perform, which depend on the software components that Amazon EMR installs in them.
Let’s look at the different types of nodes are given as follows:
Primary node: The primary node in an Amazon EMR cluster has a software component that manages the overall coordination and execution of tasks in the cluster. It coordinates the distribution of tasks across core and task nodes, monitors the health of the cluster, and manages communication between nodes.
Core node: Core nodes have a software component that let’s it store and process data. They typically run the Hadoop Distributed File System (HDFS) and execute data processing, storage, and retrieval tasks. Core nodes store data blocks and perform data replication for fault tolerance.
Task node: Task nodes are additional compute resources used for processing tasks in parallel. The software component does let them store data like core nodes but only executes tasks assigned by the primary node. These are optional nodes and are often added to increase processing capacity without increasing storage capacity.
Amazon EMR architecture
The Amazon EMR service architecture is divided into multiple layers, each providing the cluster with specific functionalities and ...