How to handle missing values in machine learning

Handling missing values is an essential aspect of data cleaning and preprocessing in machine learning tasks. There are a lot of techniques that can be used to handle missing data, depending on the specific use case and the sensitivity of missing data in the dataset.

Let's take a look at a few methods that can be used to handle the missing data, including deleting missing values, imputing the most frequent value, replacing with a predicted value, multiple imputation, and impute nearest neighbors.

Method 1: Delete the missing values

This approach is the quickest but not the preferred one because there are high chances of ending up deleting important data. Generally, if the missing value is MNAR type we should not delete the data, but if it is MCAR or MAR, we can possibly delete it. There are two ways to delete the data containing missing values.

Delete the row with a missing value(s): It is a listwise deletion process in which we remove the entire row from the dataset which contains a lot of missing values. However, it is important to know that if all the rows have some missing data, we might delete the whole dataset. We can use myDataFrame.dropna() and pass axis = 0 as a parameter.

Method 2: Impute the most frequent value

This approach imputes the missing values for the categorical features. We check the whole dataset and replace the missing value cells with the mode value. We import and use SimpleImputer from the sklearn.impute module in this method and pass the strategy as most_frequent to it as a parameter. It is mostly used when dealing with non-numerical values.

Let's take an example of a dataset that contains types of shapes but has a nan value in it. Let's print it as it is, and you will see that the missing value will be printed as NaN.

Method 3: Replace with a predicted value

This approach predicts the missing value based on the existing data and then replaces the missing values with it. There are two further sub-approaches through which we can do this.

Replace with an univariate statistic: In this approach, we only consider a single feature from the dataset. We import SimpleImputer from the sklearn.impute and replace the missing value with the mean of each column.

Let's take a look at the example below and see how the missing values are replaced.

Lines 7–8: Create a 2D array and use transform() to replace all the missing values in it with the mean value of the corresponding column and print the results.

Replace with a multivariate statistic: In this approach, we consider more than one feature from the dataset. We import IterativeImputer from the sklearn.impute and replace the missing value that corresponds to the other variables. In this case, one variable can influence the expected results for the other variable.

Let's take a look at the example below in which we use an existing dataset of Titanic to see how the missing values are replaced. The Age is missing for the sixth row in the original dataset but we predict the age based on the regression model that is built.

Line 5: Create a DataFrame instance df and read the first 7 rows of the Titanic dataset from the given URL and store them in it.
Lines 6–7: Select the columns from the dataset to be considered and store them in a new DataFrame X. In this case, we choose SibSp the number of siblings or spouses aboard, Fare of the ticket, and Age.
Line 9: Create an instance of IterativeImputer that will build a regression model with the SibSp and Fare variables and then make predictions for the Age column based on it.

Line 11: Use transform() to replace all the missing values in the Age column with the predicted values based on the regression model and print the results.

Method 4: Impute nearest neighbors

This approach is also a multivariate approach where the missing value is predicted based on the other corresponding variables in the dataset. In this case, we use apply Euclidean distance to find the nearest neighbors for the missing value. We import KNNImputer from the sklearn.impute and replace the missing value with the average of the Age values of the two rows that are closest to the row with the missing value. In this case, one variable can influence the expected results for the other variable.

Let's take a look at the example below, which uses the same Titanic dataset and columns as in the above example. Now let's find the missing Age in the sixth row, using KKNImputer.

Line 4: Create a DataFrame instance df and read the first 7 rows of the Titanic dataset from the given URL and store them in it.
Lines 5–6: Select the columns from the dataset to be considered and store them in a new DataFrame X. In this case, we choose SibSp the number of siblings or spouses aboard, Fare of the ticket, and Age.
Line 8: Create an instance of KNNImputer and specify the number of neighbors to find. Here, we find two neighbors with the closest Fare to the Fare of the 6th row, which has a missing value. In this case, the 3rd and 5th rows have the closest values, so the average of the Age values in these two columns will be the value used to replace the missing value.

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

How to handle missing values in machine learning

Method 1: Delete the missing values

Method 2: Impute the most frequent value

Method 3: Replace with a predicted value

Method 4: Impute nearest neighbors

Summary

Test your understanding