Introduction to Apache Spark
Let's have an introduction to Apache Spark and its architecture.
Apache Spark is a data processing system that was initially developed at the University of California by
Note that Apache Spark was developed in response to some of the limitations of MapReduce…
Limitation of MapReduce
The MapReduce model allowed developing and running embarrassingly parallel computations on a big cluster of machines. Still, every job had to read the input from the disk and write the output to the disk. As a result, there was a lower bound in the latency of job execution, which was determined by disk speeds. So the MapReduce was not a good fit for:
- Iterative computations, where a single job was executed multiple times or data were passed through multiple jobs.
- Interactive data analysis, where a user wants to run multiple ad hoc queries on the same dataset.
Note that Spark addresses the above two use-cases.
Foundation of Spark
Spark is based on the concept of Resilient Distributed Datasets (RDD).
Resilient Distributed Datasets (RDD)
RDD is a distributed memory abstraction used to perform in-memory computations on large clusters of machines in a fault-tolerant way. More concretely, an RDD is a read-only, partitioned collection of records.
RDDs can be created through operations on data in stable storage or other RDDs.
Types of operations performed on RDDs
The operations performed on an RDD can be one of the following two types:
Transformations
Transformations are lazy operations that define a new RDD. Some examples of transformations are map, filter, join, and union.
Actions
Actions trigger a computation to return a value to the program or write data to external storage. Some examples of actions are count, collect, reduce, and save.
Get hands-on with 1400+ tech skills courses.