ETL Pipeline Exercise: Transform

To continue our pipeline implementation, we’ll now focus on transforming the extracted data. According to the business requirements and the schema of the data warehouse, there are a few issues we need to fix with our extracted data. They are:

  1. To change the month format of all date columns from numerical to text (for example, from 08 to Aug)

  2. To remove tabs and new lines from columns comment_text and post_text

  3. To bin the number of followers into three categories, low, medium, high (the number of followers lower than 1000 will be in low category, the number of followers between 1000 and 5000 will be in medium category, and higher than 5000 will be in high category)

As before, we’ll write a function that accomplishes these tasks in a file called helper.py. Then, we’ll import the function to the DAG to add it to our pipeline.

Note: Fill in the missing code snippets.

Change the month format

As mentioned, the first task in the transform stage is to change the month format of columns: post_date and comment_date from numerical to text. Use the command strftime to modify the format of datetime objects.

Get hands-on with 1400+ tech skills courses.