...

Exploratory Data Analysis

Perform the exploratory data analysis for our project.

We'll cover the following...

Missing data and class imbalance
Possible reasons for missing data
Techniques to deal with the missing data
Complete case analysis

Press + to interact

According to the concise summary of the data, there are some missing values in the data.

Missing data and class imbalance

There are missing values in the dataset.

It’s better to see the numbers as a percentage of missing data and think about strategies based on domain knowledge. Usually, in clinical datasets, missing data also includes cases where tests are not required for some stated reasons, for example, did not meet a certain threshold. Patterns in the missing data are helpful, and it’s essential to understand and explore the reasons for such patterns.

We can also see that the classes are not balanced.

Out of 400 observations, we have 250-ckd and 150-notckd. The ckd:notckd is 3:5, and notckd are around 37.5% of the total data. This imbalance is manageable and should not be a big issue if we can manage to fix the missing data. A good start is splitting the data for training and testing with a 60:40 ratio. Later, we can think about oversampling the minority class using different algorithms we have discussed in the course.

EDA is vital, and we also want to check the distributions of each variable in our dataset. There might be a case that a variable has overwhelmingly taken a specific value, which may not be helpful for predictions. We also notice from the dataset that several variables have blood in their name. Is there any relationship between them? If a specific set of features is correlated, we might consider including interaction terms for them or removing one while building a linear model. Let’s start checking the percentage of missing data.

Press + to interact

Course Introduction

Linear Regression

Regularization

Bias-Variance Trade-off

Categorical Features

Logistic Regression

Logistic Regression: Titanic Data

Sentiment Analysis Using Multinomial Logistic Regression

Multiclass Classification and Handling Imbalanced Classes

Project: Predicting Chronic Kidney Disease

K-Nearest Neighbors

Implementation of K-Nearest Neighbors

Logistic Regression vs. KNN

Decision Tree Learning

Implement the Decision Tree Classifier from Scratch

Bootstrapping and Confidence Interval

Support Vector Machine

Practice and Comparisons

What's Next?

Appendix

Exploratory Data Analysis

Missing data and class imbalance

Possible reasons for missing