Automated Feature Engineering

Automating processes in data science using approximations.

Overview

Automated feature engineering is a powerful tool for reducing the amount of manual work needed in order to build predictive models. Instead of a data scientist spending days or weeks coming up with the best features to describe a dataset, we can use tools that approximate this process. One library I’ve been working with to implement this step is FeatureTools. It takes inspiration from the automated feature engineering process in deep learning. However, it is meant for shallow learning problems where you already have structured data but need to translate multiple tables into a single record per user.

In our pre-configured execution environment, gcc and Python3.7 are already installed along with the required libraries. To skip local installation instructions and get on with the applications, click here.

Installing libraries

The library can be installed as follows:

Press + to interact
sudo yum install gcc
sudo yum install python3-devel
pip3 install framequery
pip3 install fsspec
pip3 install featuretools==1.8.0

In addition to this library, I loaded the framequery library, which enables writing SQL queries against dataframes. Using SQL to work with dataframes versus specific interfaces, such as Pandas, is useful when translating between different execution environments.

Getting started

The task we’ll apply the FeatureTools library to is predicting which games in the Kaggle NHL dataset are postseason games. We’ll make this prediction based on summarizations of the play events that are recorded for each game. Since there can be hundreds of play events per game, we need a process for aggregating these ...

Access this course and 1400+ top-rated courses and projects.