ETL Transformation Example: Addressing Data Quality Issue

Learn how to use Bash scripting to transform and process the raw lottery data before loading.

Transform

Now that we have extracted the raw data, let’s transform it according to the needs and context of the business. At this point, we need to talk to the user who is requesting the data. For this demonstration, the user is the company’s data scientist.

The data scientist requests that the data be in tabular form without missing or null values. Also, there shouldn't be any duplicate dates in the file, and the data needs to have eight columns separated by commas. The columns are:

  1. “Date”

  2. “First_Lottery_number”

  3. “Second_Lottery_number”

  4. “Third_Lottery_number”

  5. “Fourth_Lottery_number”

  6. “Fifth_Lottery_number”

  7. “Sixth_Lottery_number”

  8. “Multiplier”

Furthermore, they would like us to sort the “Date” column in descending order.

Task 1: Make the data tabular

The first request is for the data to be tabular. Tabular data can be organized in a table with rows and columns, and each cell has exactly one value. A popular format for storing tabular data is the CSV format.

Get hands-on with 1400+ tech skills courses.