...

/

Exercise: Exploring and Cleaning the Data

Exercise: Exploring and Cleaning the Data

Learn to explore and clean the data for predictive modeling.

Thus far, we have identified a data quality issue related to the metadata: we had been told that every sample from our dataset corresponded to a unique account ID, but found that this was not the case. We were able to use logical indexing and pandas to correct this issue. This was a fundamental data quality issue, having to do simply with what samples were present, based on the metadata. Aside from this, we are not really interested in the metadata column of account IDs: these will not help us develop a predictive model for credit default.

Examining features and data quality

Now, we are ready to start examining the values of the features and response variable, the data we will use to develop our predictive model. Perform the following steps to complete this exercise:

  1. Load the results of the previous exercise and obtain the data type of the columns in the data by using the info() method as shown below:

    import pandas as pd
    df_clean_1 = pd.read_csv('df_clean_1.csv')
    df_clean_1.info()
    

    You should see the following output:

Press + to interact
Getting columns metadata
Getting columns metadata

We can see in figure above that there are 25 columns. Each row has 29,685 non-null values, according to this summary, which is the number of rows in the DataFrame. This would indicate that there is no missing data, in the sense that each cell contains some value. However, if there is a fill value to represent missing data, that would not be evident here.

We also see that most columns say int64 next to them, indicating they are an integer data type, ...