Introduction to Spark
Learn about a new framework for efficient, iterative, and interactive data processing.
The original MapReduce system set the stage to process a large volume of data efficiently by adding more worker machines to get faster results. The runtime of MapReduce automatically managed the cumbersome details of distributed and parallel computing. Moreover, the programming interface was also very simple to use for the programmers. Even though MapReduce is primarily a batch-oriented data processing system, all the data to be processed should be available at the start of the process. Many of our data processing needs are not met by the MapReduce model. Two prominent examples are as follows:
-
Iterative data process where we use the same data repeatedly until we converge on some goal
-
Interactive, ad hoc queries on the same set of data
While one can argue that for both of the scenarios above, we can still use the MapReduce framework where we repeatedly use Map
and Reduce
tasks. Though the problem is that the latency to get the result will be non-real time and fairly high because each new iteration of the MapReduce
job reads input data from the replicated persistent store (that it just wrote as output in the last cycle). We need a new processing framework without the inefficiencies of the MapReduce model.
Level up your interview prep. Join Educative to access 70+ hands-on prep courses.