System Design Deep Dive: Real-World Distributed Systems/

...

Evaluation of Spark

Let's evaluate how Spark fulfills its promised functionalities.

We'll cover the following...

Latency
Fault tolerance
Data locality
Memory management
Conclusion
- System design wisdom in Spark

Spark can be used efficiently for many data processing use cases. Spark does data processing in memory. Hence, it should provide low latency. Other functionalities that Spark provides include fault tolerance, data locality, persistent in-memory data, and memory management. Let’s discuss how well Spark provides these functionalities with the following arguments.

Note: All the computational results and time spent on them that is stated in the text below are gathered from the paper Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Experiments are done on a data of 100 GB with approximately 25 to 100 machines (for different experiments) with 4 cores and 15 GB RAM each.

Latency

When we use Spark to perform an algorithm that requires more than one iteration, for example, the K-means algorithm or logistic regression, we get to know the speed-up achieved by Spark. Suppose we perform such a task with Hadoop (an open-source implementation of the MapReduce framework). In that case, it will run slower even if we use HadoopBinMem (HadoopBM), which converts the data into a binary format and stores it in a replicated instance of in-memory HDFS for the following reasons.

Overheads: The first overhead that makes Hadoop slower than Spark is the signaling overhead due to Hadoop's heartbeat protocolDriver keeps sending a signal to the workers to check if they are working fine or not, if there is no response from a worker, that means the worker has failed. between manager and worker ...

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

Evaluation of Spark

Latency