...

/

Working with DataFrame's Schemas

Working with DataFrame's Schemas

Learn about DataFrame's structure or schema with practical examples from a new project.

Let’s imagine a scenario where the requirement is to ingest one CSV file with a specific format that doesn’t match our DataSource model. Changing an already existing data model can be too costly and impact other applications that feed off the database.

Fortunately, we already know how to modify the data when it resides on a Spark DataFrame and change its structure (such as adding one column in one of our previous lessons). The API also offers the possibility of removing columns and other exciting operations. Let’s see how we can achieve it.

Working on ingested data

The previous hypothetical requirement can be defined as the following


A client is sending data, to store in our DataSource (DB), in a CSV format that doesn’t fit into our normalized data model.

Our application should perform the necessary transformations to persist said information to the DB, with a matching structure.


The following widget contains the codebase for this lesson:

mvn install exec:exec
Project to ingest from CSV and work on DataFrame schemas

To ease our way into understanding the code snippet and application logic let’s illustrate the application’s flow with a simple diagram.

Note: If curiosity “sparks” within the reader, the following link can serve as a gentle introduction to what data normalization is. Although, it’s technically not necessary for this course.

The application ingests a CSV file (left on the diagram) using the same API methods we’ve used before. Right after this, it calls two transformations on the DataFrame.

But first, we need to learn new API invocations to inspect the schema or structure of the DataFrame. This is necessary because our database has been previously normalized to split different information into related tables, rather than having only one massive hard-to-maintain table.

Afterwards, we can apply transformations to drop columns and separate the information into two DataFrames that resemble the two tables in the database.

Inspecting a

...