High-level Design of MapReduce

Let’s start designing our system on a high level. This lesson will overview the components and their combined implementation to achieve the functional and non-functional requirements mentioned in the previous lesson.

High-level components

Let’s list the main components we need to design the MapReduce system.

  • Distributed file system: We’re using the Google File System (GFS) as our distributed file system for storing the input data. We’ll explain the detailed functionality of this distributed system, concerning our system, in the detailed design lesson of this chapter.
  • Cluster: We need a cluster of machines to process the data in parallel.
  • User program: We need the user program, mainly the Map and Reduce functions, to run on all the workersA worker is one commodity machine inside a cluster capable of achieving the system’s functionality independently. It gets its share of the work through a master. for data processing.
  • Scheduler: Before starting the MapReduce operation, the user program gets installed on all the workers with which they can perform the dedicated task assigned to them. We need a scheduler to manage the job assignment for various workers. It mainly optimizes the workers’ usage in the cluster.

High-level implementation

Our system works on the following scheme:

  1. We divide the input data into a specific number of splits, processed individually on workers using a Map function, producing their own intermediate key-value pairs.
  2. We distribute these intermediate key-value pairs into the next type of workers (performing the Reduce operation), using the following hash function:

Level up your interview prep. Join Educative to access 80+ hands-on prep courses.