High-level Design of MapReduce
Let's get an overview of the MapReduce system's design.
Let’s start designing our system on a high level. This lesson will overview the components and their combined implementation to achieve the functional and non-functional requirements mentioned in the previous lesson.
High-level components
Let’s list the main components we need to design the MapReduce system.
- Distributed file system: We’re using the Google File System (GFS) as our distributed file system for storing the input data. We’ll explain the detailed functionality of this distributed system, concerning our system, in the detailed design lesson of this chapter.
- Cluster: We need a cluster of machines to process the data in parallel.
- User program: We need the user program, mainly the
Map
andReduce
functions, to run on all the for data processing.workers A worker is one commodity machine inside a cluster capable of achieving the system’s functionality independently. It gets its share of the work through a master. - Scheduler: Before starting the MapReduce operation, the user program gets installed on all the workers with which they can perform the dedicated task assigned to them. We need a scheduler to manage the job assignment for various workers. It mainly optimizes the workers’ usage in the cluster.
High-level implementation
Our system works on the following scheme:
- We divide the input data into a specific number of splits, processed individually on workers using a
Map
function, producing their own intermediate key-value pairs. - We distribute these intermediate key-value pairs into the next type of workers (performing the
Reduce
operation), using the following hash function:
Level up your interview prep. Join Educative to access 80+ hands-on prep courses.