Distributed Feature Engineering

Encoding data and applying deep feature synthesis.

Overview

Feature engineering is a key step in a data science workflow, and sometimes, it is necessary to use Python libraries to implement this functionality. For example, the AutoModel system at Zynga uses the Featuretools library to generate hundreds of features from raw tracking events, which are then used as input to classification models. To scale up the automated feature engineering approach that we first explored in Automated Feature Engineering, we can use Pandas UDFs to distribute the feature application process. Like the prior section, we need to sample data when determining which transformation to perform, but when applying the transformation we can scale it to massive datasets.

For this lesson, we’ll use the game plays dataset from the NHL Kaggle example, which includes detailed play-by-play descriptions of the events that occurred during each match. Our goal is to ...

Access this course and 1400+ top-rated courses and projects.