Parallel data processing—the domain of a Jedi programmer!

Parallel computing is known to be difficult, intense, and full of potential minefields in terms of HeisenbugsA software bug that disappears or changes its behavior when one attempts to investigate or resolve it.. Combined with the rise of many-core servers and distributed computing, parallel computing is useful and can’t be ignored by regular programmers to speed up their applications, or leave it to Jedi programmers.

Google introduced a new programming model (MapReduce) that enabled programmers (of any expertise level) to express their data processing needs as if they were writing sequential code. The MapReduce runtime automatically takes care of the messy details of distributing data and running the jobs in parallel on multiple servers, even under many fault conditions. The widespread use of the MapReduce model proves its applicability to a broad range of data processing problems.

In this chapter, we will study the design of the MapReduce system in detail and the application of its programming model.

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

MapReduce Deep Dive for System Design

Parallel data processing—the domain of a Jedi programmer!

Motivation