Imputation of Missing Values

Missing data is a common problem in real-world datasets that can arise due to a variety of reasons, such as measurement errors, data corruption, or simply because the data was not collected. When working with such datasets, we need to handle missing values appropriately, since many ML algorithms cannot handle missing data.

Imputation is the process of filling in missing values with estimated values based on the available data. This can be a challenging task because it requires us to carefully consider the type of missing data, the nature of the dataset, and the problem we are trying to solve.

Imputation of missing values can be helpful for many reasons. By filling in missing values, we can avoid the loss of valuable information and reduce the potential bias in our analysis. It can also improve the accuracy of our models and lead to better insights.

We’ll explore the different strategies for imputing missing values in scikit-learn, such as simple imputation, iterative imputation, and KNN imputation. By the end, we’ll have a thorough understanding of how to handle missing data in our datasets.

In the image below, we can see how imputation works by replacing invalid values such as “NaN” and “?” with valid values. The question we’ll answer here is how to determine the new values for this missing data.

Get hands-on with 1400+ tech skills courses.