Introduction to Apache Spark
Let's have an introduction to Apache Spark and its architecture.
Apache Spark is a data processing system that was initially developed at the University of California by
Note that Apache Spark was developed in response to some of the limitations of MapReduce…
Limitation of MapReduce
The MapReduce model allowed developing and running embarrassingly parallel computations on a big cluster of machines. Still, every job had to read the input from the disk and write the output to the disk. As a result, there was a lower bound in the latency of job execution, which was determined by disk speeds. So the MapReduce was not a good fit for:
- Iterative computations, where a single job was executed multiple times or data were passed through multiple jobs.
- Interactive data analysis, where a user wants to run multiple ad hoc queries on the same dataset.
Note that Spark addresses the above two use-cases.
Foundation of Spark
Spark is based on the concept of Resilient Distributed Datasets (RDD).
Resilient Distributed Datasets (RDD)
RDD is a distributed memory abstraction used to perform in-memory computations on large clusters of machines in a fault-tolerant way. More concretely, an RDD is a ...