Automated Feature Engineering
Automating processes in data science using approximations.
Overview
Automated feature engineering is a powerful tool for reducing the amount of manual work needed in order to build predictive models. Instead of a data scientist spending days or weeks coming up with the best features to describe a dataset, we can use tools that approximate this process. One library I’ve been working with to implement this step is FeatureTools. It takes inspiration from the automated feature engineering process in deep learning. However, it is meant for shallow learning problems where you already have structured data but need to translate multiple tables into a single record per user.
In our pre-configured execution environment, gcc and Python3.7 are already installed along with the required libraries. To skip local installation instructions and get on with the applications, click here.
Installing libraries
The library can be installed as follows:
sudo yum install gccsudo yum install python3-develpip3 install framequerypip3 install fsspecpip3 install featuretools==1.8.0
In addition to this library, I loaded the framequery library, which enables writing SQL queries against dataframes. Using SQL to work with dataframes versus specific interfaces, such as Pandas, is useful when translating between different execution environments.
Getting started
The task we’ll apply the FeatureTools library to is predicting which games in the Kaggle NHL dataset are postseason games. We’ll make this prediction based on summarizations of the play events that are recorded for each game. Since there can be hundreds of play events per game, we need a process for aggregating these ...