Big data primer

Before we describe the processing model that Spark fits into in both the context of this course and big data, it’s important to explain what big data means.

The term big data fundamentally refers to various technologies aligned with different strategies on how to process large datasets of information.

The word “large” has traditionally and implicitly included the notion that whatever dataset is being processed, it packs an amount of information that realistically cannot be processed by a single resource, such as a lone server or computer. Because available processing power and business needs are constantly changing, the word also includes the notion that the exact size of a dataset is not estimated to a specific figure.

As vague as it might seem, “big” is an appropriate word to refer to datasets that are undefined by the limits of their size while representing vast volumes of information. So, big data solutions aim to solve the problem that conventional methods face while working with them.

Another characteristic of big data scenarios is the variety of sources that the information comes from. These sources range from application systems’ logs and social networks data to physical devices’ output. In turn, this scenario introduces a variety of formats a big data solution might be expected to work with. With different sources come different formats.

Whereas traditional systems might expect input formatted or labeled data, big data systems need to deal with raw data and eventually transform it into meaningful information according to different business ...

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Spark and Big Data

Big data primer