Spark Fundamentals
Overview of Spark as a tool to solve specific problems and its structure.
Why choose Spark?
As the demand to process data and generate information continues to grow, engineers and data scientists are increasingly searching for easy and flexible tools to carry out parallel data analysis. This becomes even more apparent with the dawn of cloud computing, where processing power and horizontal scaling are more available.
Spark comes into this picture as one such tool due to the following principal reasons:
Ease of use: Spark is straightforward to use in comparison to other existing tools that pre-date it, such as Hadoop with MapReduce engine. It enables developers to focus on the logic of computation while they code on high-level APIs. It can also be installed and used on a simple laptop.
Speed: Spark is incredibly fast and is continuously praised for it in the big data world.
General-purpose engine: Spark allows developers to use and combine multiple types of computations, such as SQL queries, text processing, machine learning, etc.
What is Spark?
Spark is fundamentally a cluster-based computational platform designed to be fast and general purpose. If we attempt to define a specific purpose for Spark we’d find ourselves constrained by the many use cases this technology offers. However, Spark is usually referred to as a unified analytics engine for large-scale data processing.
In developers’ terms, the beauty of Spark is in the fact that it is a set of libraries written in different languages, such as Scala, Java, Python, etc… Spark enriches a program with intensive processing capabilities and provides a distributed nature, which allows it to run on every machine within a cluster in a coordinated fashion.
Because Spark runs computations in memory as much as possible, it reduces the time taken from hours to minutes while processing large datasets. It does so by primarily processing data chunks in memory, rather than relying on I/O devices such as hard disks, which introduces higher latencies in data processing and transferring in general.
Its general-purpose nature is expressed by being able to support a wide range of different features, such as imperative programming (iterative algorithms), SQL querying, stream processing, and, in this course’s case, batch processing.
Spark is open source and backed by the Apache Software Foundation, thus benefiting from developer’s contributions that aim to make it more efficient or include new features (naturally undergoing a review, check, and approval process by Apache.)
Big tech giants such as Netflix and eBay have deployed Spark at a massive scale, and the adoption of Spark spans multiple and diverse industries as well.
A brief history of Spark
So what are the precursors to Spark? Let’s take a look at Spark’s history, the history of related technologies.
Enter MapReduce
MapReduce was born in 2004 as a distributed processing framework originally referred to as Google propriety. It can broadly be described as a model for processing big datasets in parallel and on a cluster, mainly comprising of three operations:
-
A Map procedure can be seen as a function applied to local data in different nodes of a cluster, this defines a “mapping” stage. This means that it identifies or defines IDs and their corresponding values for records.
-
A Shuffle stage happens within the nodes of a cluster, usually accompanied by a Sort one, which operates on the results from the Map stage and arranges related information at a global scale. This means that it puts information together in the shape of key-value pairs, based on matching keys. It also sorts the information as it goes.
-
Then, a Reduce operation is conducted on the output of the previous stage, which executes on the nodes of the cluster again, and applies some computation to aggregate the mapped information.
This model implements a “Divide and Conquer” approach, where a big problem (in this case applying an algorithm efficiently to big quantities of data) is split into smaller parts and executed in parallel. This thus uses resources more efficiently and is more time effective.
Let’s bring an illustration of this process with a simple example:
Get hands-on with 1300+ tech skills courses.