Big Data and Apache Spark

Learn more about big data and big data processing.

In this chapter we will feature a widely popular big data processing framework called Apache Spark. And in the next chapter, we will discuss a distributed database system.

What is big data?

Answering this question is a bit tricky, given that the definition depends a lot on the context. But let’s first start somewhere.

Big data is a large amount of data that cannot be stored or processed using traditional methods.

In traditional data-processing methods, data is processed on a machine using simple techniques. On the machine, there is some amount of data stored on the disk. A program is run to read the data, extract what is required from the data, and then process it. Suppose the data is small and can be easily processed on an average machine using obvious techniques. In that case, we do not require any fancy specialized algorithm (like MapReduce) to process the data in a reasonably fast way. But things get more complicated when the data’s size is unmanageable on an average machine. The concept of big data gained popularity with the rise of the internet. A few decades ago, engineers were not particularly concerned about processing a large amount of data. The reason was obvious—internet and internet services were not anywhere near as ubiquitous as in the present time. With the rise of internet services, data became the key to creating better services for customers. But the amount of data produced is simply incomparable to the capacity of an average machine. As a result, the concepts and solutions behind big data processing rose in popularity.

Processing big data is a very challenging problem, and all companies need to solve it sooner or later.

Classification

We can classify big data into three categories.

  • Structured: This type of data has a well-defined structure. Think of a SQL table. All the columns and their data types are predefined. In structured big data, we know what data will be there and what the fields and their types are.

  • Semi-structured: In semi-structured data, there is no strict structure, but there is some structure. For example, the log data from an application where every line has a timestamp and a string. But the string can contain a variety of data, like error objects, application-related metrics, etc.

  • Unstructured: As we already suspect, unstructured data has no structure to it. Image or video files are good examples of unstructured data. In general, unstructured data has its associated metadata to describe its properties, which tend to be structured or semi-structured. For example, an image has properties like resolution, size, date of creation, etc. All this info is captured separately in better defined metadata.

Apache Spark

We know that processing big data is challenging. But with time, researchers and engineers have developed well-defined mechanisms. Since many companies started to work on big data, and there were a lot of overlaps among their work, different frameworks were developed to make things easier for everyone. One such framework is Apache Spark.

In reality, Spark is no longer just a framework—it has become an ecosystem. It’s an analytics engine with a wide variety of support for different big data use cases, like batch and stream processing, machine learning, etc.

So what does Apache Spark provide that helps with big data processing?

Like many frameworks, Spark provides excellent APIs and data structures supported in popular languages like Java, Scala, Python, etc. Armed with this support, engineers can develop complex data processing in a much simpler way.

Spark enables us to write code as if the code is getting executed synchronously. But in the background, Spark takes care of everything related to processing the input data in parallel so that the processing is fast. This makes writing big data solutions simpler for the developers.

Key takeaways

  • Big data refers to a large volume of data that requires special treatment for processing due to its size.

  • Different frameworks have been developed for facilitating big data processing. One such example is Apache Spark, which empowers simplicity for developers.

Get hands-on with 1400+ tech skills courses.