Core components

Behind the scenes, Spark is comprised of a core component on top of which different libraries sit. This is no accident, as the creators of Spark applied this type of architecture to continue adding modules pertaining to different functionalities.

This type of architecture resembles a “Plugin Architecture,” in which features can be developed and incorporated over time.

Let’s take a brief look at each of them.

Spark Core

The nucleus of Spark contains the basic but fundamental functionalities for scheduling applications execution, memory management, storage systems’ interaction, fault recovery, etc.

Spark Core is the home of the Resilient Distributed Dataset (RDD) data structure, an in-memory fault-tolerant and immutable collection of elements representing partitioned data. Besides raw data, it can also contain a more complex type of data such as Scala, Python, or Java programmatic constructs (such as classes or data structures.)

RDDs have been a part of Spark since the 1.0 version and allow object-based and functional approaches to manipulate the data it represents.

This translates into more complex, richer, data manipulation by having functions applied to the elements of collections. A typical example of this is the aptly named Pair RDD: a representation of a collection of data identified by a key or ID for each of the records.

The following image depicts this extensively used technique in the early days of Spark, and is used frequently nowadays in Spark Scala API, a library with a stronger focus on the functional programming paradigm:

The data structure used in this course, and commonly used as the de-facto abstraction provided by the Spark Java API, resides in a different Spark component. Let’s take a quick look.

Spark SQL

Spark SQL is a component that works with data in a structured fashion as well as allows data querying in a style that is quite similar to Structured Query Language (SQL). The crucial difference here is the possibility of querying distributed information efficiently.

Spark SQL allows the Developer to work with metadata about the structure of ...

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Components and Architecture

Core components

Spark Core

Spark SQL