System Design Deep Dive: Real-World Distributed Systems/

...

High-level Design of MapReduce

Let's get an overview of the MapReduce system's design.

We'll cover the following...

High-level components
High-level implementation
Programming model

Let’s start designing our system on a high level. This lesson will overview the components and their combined implementation to achieve the functional and non-functional requirements mentioned in the previous lesson.

High-level components

Let’s list the main components we need to design the MapReduce system.

Distributed file system: We’re using the Google File System (GFS) as our distributed file system for storing the input data. We’ll explain the detailed functionality of this distributed system, concerning our system, in the detailed design lesson of this chapter.
Cluster: We need a cluster of machines to process the data in parallel.

User program: We need the user program, mainly the Map and Reduce functions, to run on all the workersA worker is one commodity machine inside a cluster capable of achieving the system’s functionality independently. It gets its share of the work through a master. for data processing.
Scheduler: Before starting the MapReduce operation, the user program gets installed on all the workers with which they can perform the dedicated task assigned to them. We need a scheduler to manage the job assignment for various workers. It mainly optimizes the workers’ usage in the cluster.

High-level implementation

Our system works on the following scheme:

We divide the input data into a specific number of splits, processed individually on workers using a Map function, producing their own intermediate key-value pairs.
We

...

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

High-level Design of MapReduce

High-level components

High-level implementation