Exploratory Data Analysis
Perform the exploratory data analysis for our project.
Let’s do the EDA of the CKD dataset. For that, we start with a concise summary of the CKD data that we loaded in our previous lesson (that is, ckd
).
print(ckd.info())
Let's also get the individual class
counts.
print(ckd['class'].value_counts())
In addition, let’s get the summary statistics, excluding NaN values.
print(ckd.describe())
According to the concise summary of the data, there are some missing values in the data.
Missing data and class imbalance
There are missing values in the dataset.
It’s better to see the numbers as a percentage of missing data and think about strategies based on domain knowledge. Usually, in clinical datasets, missing data also includes cases where tests are not required for some stated reasons, for example, did not meet a certain threshold. Patterns in the missing data are helpful, and it’s essential to understand and explore the reasons for such patterns.
We can also see that the classes are not balanced.
Out of 400 observations, we have 250-ckd
and 150-notckd
. The ckd:notckd
is 3:5, and notckd
are around 37.5% of the total data. This imbalance is manageable and should not be a big issue if we can manage to fix the missing data. A good start is splitting the data for training and testing with a 60:40 ratio. Later, we can think about oversampling the minority class using different algorithms we have discussed in the course.
EDA is vital, and we also want to check the distributions of each variable in our dataset. There might be a case that a variable has overwhelmingly taken a specific value, which may not be helpful for predictions. We also notice from the dataset that several variables have blood in their name. Is there any relationship between them? If a specific set of features is correlated, we might consider including interaction terms for them or removing one while building a linear model. Let’s start checking the percentage of missing data.
print("% of missing data in each column -- Highest first:\n")print(((ckd.isnull().sum()/len(ckd)*100)).sort_values(ascending=False))
Each column has some missing data other than class
, which is the target. Particularly, the rbc
, wbcc
, and rbcc
columns have more than 25% of missing data in each column. Typically, any column with over 10% or 15% of the missing data is a concern. It should be addressed to avoid ...