ETL Pipeline Exercise: Transform
Learn about the transform social media pipeline using Apache Airflow.
To continue our pipeline implementation, we’ll now focus on transforming the extracted data. According to the business requirements and the schema of the data warehouse, there are a few issues we need to fix with our extracted data. They are:
To change the month format of all date columns from numerical to text (for example, from
08
toAug
)To remove tabs and new lines from columns
comment_text
andpost_text
To bin the number of followers into three categories,
low
,medium
,high
(the number of followers lower than 1000 will be inlow
category, the number of followers between 1000 and 5000 will be inmedium
category, and higher than 5000 will be inhigh
category)
As before, we’ll write a function that accomplishes these tasks in a file called helper.py
. Then, we’ll import the function to the DAG to add it to our pipeline.
Note: Fill in the missing code snippets.
Change the month format
As mentioned, the first task in the transform stage is to change the month format of columns: post_date
and comment_date
from numerical to text. Use the command strftime
to modify the format of datetime objects.
Get hands-on with 1400+ tech skills courses.