Exploratory Data Analysis—Profiling
Learn how to profile data for training machine learning models.
The first steps of machine learning
Once a machine learning project starts, the first step is splitting the data into training and test sets. This splitting of dataset will be discussed later in the course.
The second step is to perform exploratory data analysis (EDA) only on the training data.
A frequent error of new machine learning practitioners is to perform EDA on all the data before conducting the training / test split. Performing EDA on all the data can lead to the machine learning model making low quality predictions on new / unseen data, which is to say that the model overfits.
This type of overfitting is caused by information leakage. Information leakage happens when test set information influences the training of a model. Strict adherence to machine learning best practices can avoid information leakage. Information Leakage will be covered later in the course ...