Evaluation of Spark

Let's evaluate how Spark fulfills its promised functionalities.

Spark can be used efficiently for many data processing use cases. Spark does data processing in memory. Hence, it should provide low latency. Other functionalities that Spark provides include fault tolerance, data locality, persistent in-memory data, and memory management. Let’s discuss how well Spark provides these functionalities with the following arguments.

Note: All the computational results and time spent on them that is stated in the text below are gathered from the paper Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Experiments are done on a data of 100 GB with approximately 25 to 100 machines (for different experiments) with 4 cores and 15 GB RAM each.

Latency

When we use Spark to perform an algorithm that requires more than one iteration, for example, the K-means algorithm or logistic regression, we get to know the speed-up achieved by Spark. Suppose we perform such a task with Hadoop (an open-source implementation of the MapReduce framework). In that case, it will run slower even if we use HadoopBinMem (HadoopBM), which converts the data into a binary format and stores it in a replicated instance of in-memory HDFS for the following reasons.

Overheads: The first overhead that makes Hadoop slower than Spark is the signaling overhead due to Hadoop's heartbeat protocolDriver keeps sending a signal to the workers to check if they are working fine or not, if there is no response from a worker, that means the worker has failed. between manager and worker nodes. Running no-operation tasks on Hadoop gives a minimum overhead of 25s because of job setups, starting tasks, and cleaning up, as well as an overhead of HDFS while serving data to each block.

Deserialization cost: Hadoop also takes time to process text and convert binary records to Java objects usable in memory. This overhead occurs even in all cases, whether the data lies in the in-memory HDFS of a local machine or in an in-memory file.

Spark stores RDD elements as Java objects directly in memory to avoid all these overheads.

In the first iteration of the K-means algorithm performed for 10 iterations on 100 machines, Spark completes the first iteration in 82 seconds. Hadoop is a bit slower than Spark because of its heartbeat protocol. HadoopBinMem completes its first iteration in 182 seconds and Hadoop in 115 seconds. HadoopBinMem is the slowest because it has to perform an additional MapReduce job to convert data into binary format and write it in an instance of in-memory HDFS.

Level up your interview prep. Join Educative to access 80+ hands-on prep courses.