ETL Pipeline Exercise: Transform
Explore how to transform extracted data within an ETL pipeline by reformatting date fields from numeric to text, cleaning text columns of tabs and new lines, and categorizing follower counts into defined levels. Understand how to implement these transformations using Python functions integrated with Apache Airflow to prepare data for loading into warehouses. This lesson helps you refine and structure data to meet business and schema requirements, and export the results for subsequent use.
To continue our pipeline implementation, we’ll now focus on transforming the extracted data. According to the business requirements and the schema of the data warehouse, there are a few issues we need to fix with our extracted data. They are:
To change the month format of all date columns from numerical to text (for example, from
08toAug)To remove tabs and new lines from columns
comment_textandpost_textTo bin the number of followers into three categories,
low,medium,high(the number of followers lower than 1000 will be inlowcategory, the number of followers ...