Evaluation of Spark

Let's evaluate how Spark fulfills its promised functionalities.

Spark can be used efficiently for many data processing use cases. Spark does data processing in memory. Hence, it should provide low latency. Other functionalities that Spark provides include fault tolerance, data locality, persistent in-memory data, and memory management. Let’s discuss how well Spark provides these functionalities with the following arguments.

Note: All the computational results and time spent on them that is stated in the text below are gathered from the paper Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Experiments are done on a data of 100 GB with approximately 25 to 100 machines (for different experiments) with 4 cores and 15 GB RAM each.

Latency

When we use Spark to perform an algorithm that requires more than one iteration, for example, the K-means algorithm or logistic regression, we get to know the speed-up achieved by Spark. Suppose we perform such a task with Hadoop (an open-source implementation of the MapReduce framework). In that case, it will run slower even if we use HadoopBinMem (HadoopBM), which converts the data into a binary format and stores it in a replicated instance of in-memory HDFS for the following reasons.

Overheads: The first overhead that makes Hadoop slower than Spark is the signaling overhead due to Hadoop's heartbeat protocol ...

Access this course and 1400+ top-rated courses and projects.