Spark and Big Data

Dig deeper into the Spark data processing model and its architecture.

Big data primer

Before we describe the processing model that Spark fits into in both the context of this course and big data, it’s important to explain what big data means.

The term big data fundamentally refers to various technologies aligned with different strategies on how to process large datasets of information.

The word “large” has traditionally and implicitly included the notion that whatever dataset is being processed, it packs an amount of information that realistically cannot be processed by a single resource, such as a lone server or computer. Because available processing power and business needs are constantly changing, the word also includes the notion that the exact size of a dataset is not estimated to a specific figure.

As vague as it might seem, “big” is an appropriate word to refer to datasets that are undefined by the limits of their size while representing vast volumes of information. So, big data solutions aim to solve the problem that conventional methods face while working with them.

Another characteristic of big data scenarios is the variety of sources that the information comes from. These sources range from application systems’ logs and social networks data to physical devices’ output. In turn, this scenario introduces a variety of formats a big data solution might be expected to work with. With different sources come different formats.

Whereas traditional systems might expect input formatted or labeled data, big data systems need to deal with raw data and eventually transform it into meaningful information according to different business requirements.

Alongside the evolution of big data systems, patterns started emerging. One of those is broadly defined as the big data life cycle.

Big data life cycle

Even if big data systems do not process data uniformly, there are commonalities that can be found in the processing strategies and the steps that they usually involve.

The following is a list of common steps (non-exhaustive) which are usually referred to as the big data life cycle:

  1. Ingestion: This is the process of incorporating raw data into the system for further processing. The complexity of this operation varies depending on the formats, sources, and media used to ingest the information into a system.

  2. Transformation or Analysis: This comprises different operations applied to the raw data that might transform, analyze, sort, aggregate, or filter, the bulk of the ingested data. Labeling and Validation of the data might also take place in this step. Traditionally this step is referred to as the “computing step”, and its primary goal is to produce information.

  3. Storage: Whether it is in the same or a different source, the information produced in the previous step is stored or persisted in a durable media.

  4. Visualization: In the step that provides perhaps the most value to stakeholders, the stored information is displayed or visualized with the aid of tools. This provides meaningful data insights and detect trends or patterns of how data changes over time.

Spark and the batch data processing model

Even though Spark can process data in several ways, depending on the Spark component used that targets a specific data processing model, this course focuses on the batch processing model.

To get a frame of reference, let’s quickly explain the terms relevant to batch processing.

This course’s batch processing model, which Spark allows us to construct, can be circumscribed in the broader Big data life cycle described previously. This version includes stages such as ingesting, filtering, applying transformations, and ultimately produces valuable information.

Batch processing is a method of computing information over large datasets. It involves splitting the data into smaller pieces or chunks, which are scheduled and sent to different processing units (or individual machines) for the actual computation to take place.

This computation, however, might involve shuffling information around, collecting intermediate results, and ultimately putting the pieces back together to form a final result.

The following diagram depicts a series of generic steps that define, broadly, a batch processing scheme tied to Spark as a tool for batch processing:

Get hands-on with 1400+ tech skills courses.