Enriching the Basic DataFrame Program
Let’s add a DB and produce more meaningful information using the DataFrame API.
Adding design into the mix
Documentation tends to be a reference, manual, and reminder of an application’s design to consult back and forth. It can also serve as a fundamental blueprint for a project being built, illustrating the landscape and interaction between its parts or components.
As we enrich our application in this lesson by adding a Database into the mix and applying a transformation to the data through the DataFrame API, it is helpful to illustrate how the big picture looks for our application’s changes and, more importantly, how Spark fits into it.
During the duration of this course, we’ll try to enforce this practice, especially when our main batch application increases in its complexity.
The application’s model
Because our application is not complex, the following diagram mixes both architectural and flow perspectives into one:
Our application perform the following actions, which pertain to the different components in different layers:
-
The application layer, consisting of non-Spark Java code, does some basic setup and organises the workload being passed onto Spark. It also performs tidy up tasks after receiving processing results.
-
The Spark layer does the core or bulk of the processing. First, it reads a CSV file with the input data onto a DataFrame. Then, it does some data manipulation to transform that information by instructing the cluster to perform this task in parallel. This is where the Spark driver excels in driving the whole process. Along with these tasks, the driver also instructs the worker nodes to save the modified data in the database.
-
Lastly, the Spark layer issues a command to generate a CSV file as output and return the workflow control to the application layer, which can close resources, do cleanup tasks, and more, before exiting.
With this in mind, we can proceed to see ...