Mastering Big Data with Apache Spark and Java/

...

Ingestion Job: Part I

Let’s build a Spark batch job for the batch template application that ingests information, does some processing (transformation), and persists into an in-memory database.

We'll cover the following...

Spark in action
Business domain
Requirements

Ingestion file example

Code implementation

IngesterJob
Reading input

Spark in action

Now that the application template has been fully set up in the previous lessons, we can develop a functional Job example.

The Job implements the Java Spark API in the Spark component classes, as taught in the previous sections.

Business domain

The business domain of the application is within the realm of market operations, such as sales and purchases, for a fictitious company called Market Analytics Solution Inc. This company provides retailers with a solution to process vast volumes of information around sales and acquisitions of many sorts of goods.

The Job for this lesson, the IngesterJob, offers functionality to ingest raw data as presented in files of different formats. It then transforms this data into meaningful information and persists it to a database for further processing, potentially by other Jobs downstream in a batch workflow.

Requirements

The business analyst in this company has been kind enough to provide us with the following refined requirements for the Job at hand:

The IngesterJob needs to process records of a JSON format representing sales and persist them into a database table. The records possess the following information:

Seller_Id,Date,Product,Quantity

Note: An ingestion format example is attached as well (see next subsection).

The DBA team has modeled a database with a SALES table, which has the following structure:

ID|SELLER_ID|DATE|PRODUCT|QUANTITY

Ingestion file example

The following is an extract for an ingestion file in JSON format:

Press + to interact

And the following snippet shows the same sales data as it persists in the database table:

Seller_Id,Date,Product,Quantity
//Other Records...
"Joe","12/11/2020","Beer",14
"Joe","13/11/2020","Beer",14
"Joe","13/11/2020","Food",12
"Beth","11/11/2020","Beer",16
"Beth","12/11/2020","Food",17
//Other Records...

The table’s ID field for each row is omitted for brevity reasons and because is automatically generated by the database engine.

Code implementation

Let’s describe and interpret the most relevant parts of the code that allow the application to fulfill the requirements. The following project contains the codebase for this lesson:

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix