Concrete Job implementation and batch pipelines

Any class implementing the abstract Job class needs to implement the abstract methods declared on it. This might seem like a tight constraint caused by having chosen the template method design pattern to structure the job’s code, but in fact, it helps to establish a widespread batch practice.

To have some frame of reference, the following image displays an overview of Spring Batch architecture:

The components on the diagram should look familiar. We can even attempt to establish a close-enough comparison between this architecture and the one from the application we are developing. The important part to grasp from this illustration is the similarities to the way we implemented our Job classes:

A batch Job, most of the time, does some pre-processing (read from a source), followed by some actual processing (transform or work on the read records), and finally produces some output (or write to a DataSource).

This way of structuring the code, which also imposes a flow to the execution, is the one required by the abstract class contract of the Job, which in turn follows the template method pattern.

The SparkJob class extends the abstract Job class, and types it to a Dataset of Row type (or a DataFrame as we know it), which means the job deals with these types of objects during processing and interaction with other classes.

This also fits well with the design from previous lessons because Spark DataFrames are the central structures of the application.

The following code extract displays this ...

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Driver Program and Job Implementation: Part II

Concrete Job implementation and batch pipelines