Exploratory Data Analysis—Profiling

The first steps of machine learning

Once a machine learning project starts, the first step is splitting the data into training and test sets. This splitting of dataset will be discussed later in the course.

The second step is to perform exploratory data analysis (EDA) only on the training data.

A frequent error of new machine learning practitioners is to perform EDA on all the data before conducting the training / test split. Performing EDA on all the data can lead to the machine learning model making low quality predictions on new / unseen data, which is to say that the model overfits.

This type of overfitting is caused by information leakage. Information leakage happens when test set information influences the training of a model. Strict adherence to machine learning best practices can avoid information leakage. Information Leakage will be covered later in the course .

EDA for machine learning

The purpose of EDA with the training data is threefold:

  • To become familiar with the data’s profile—the data types, the range of values of numeric features, level counts for categorical features, whether there’s missing data, etc.

  • To visualize the individual features in terms of their potential predictive power—for example, in the case of the Titanic dataset, does the Pclass feature help predict survival?

  • To assess whether there are opportunities to create new features from the data that will help the model’s predictions—this is known as feature engineering and will be covered later in the course

EDA is a critical step in crafting the most valuable machine learning models. Be prepared to invest time in this project stage—there are rewards to be had!

Profiling the Titanic training data

EDA can start immediately as the Titanic dataset is already split into training and test sets. The following code illustrates profiling the Titanic training data. Run the code and examine the output.

Get hands-on with 1200+ tech skills courses.