Exercise: Exploring and Cleaning the Data

Learn to explore and clean the data for predictive modeling.

Thus far, we have identified a data quality issue related to the metadata: we had been told that every sample from our dataset corresponded to a unique account ID, but found that this was not the case. We were able to use logical indexing and pandas to correct this issue. This was a fundamental data quality issue, having to do simply with what samples were present, based on the metadata. Aside from this, we are not really interested in the metadata column of account IDs: these will not help us develop a predictive model for credit default.

Examining features and data quality

Now, we are ready to start examining the values of the features and response variable, the data we will use to develop our predictive model. Perform the following steps to complete this exercise:

  1. Load the results of the previous exercise and obtain the data type of the columns in the data by using the info() method as shown below:

    import pandas as pd
    df_clean_1 = pd.read_csv('df_clean_1.csv')
    df_clean_1.info()
    

    You should see the following output:

Get hands-on with 1200+ tech skills courses.