How to handle missing values in machine learning

Handling missing values is an essential aspect of data cleaning and preprocessing in machine learning tasks. There are a lot of techniques that can be used to handle missing data, depending on the specific use case and the sensitivity of missing data in the dataset.

Let's take a look at a few methods that can be used to handle the missing data, including deleting missing values, imputing the most frequent value, replacing with a predicted value, multiple imputation, and impute nearest neighbors.

Method 1: Delete the missing values

This approach is the quickest but not the preferred one because there are high chances of ending up deleting important data. Generally, if the missing value is MNAR type we should not delete the data, but if it is MCAR or MAR, we can possibly delete it. There are two ways to delete the data containing missing values.

  • Delete the row with a missing value(s): It is a listwise deletion process in which we remove the entire row from the dataset which contains a lot of missing values. However, it is important to know that if all the rows have some missing data, we might delete the whole dataset. We can use myDataFrame.dropna() and pass axis = 0 as a parameter.

Delete the row with missing values.
Delete the row with missing values.
  • Delete the column with a missing value(s): It is a variable-wise deletion process in which we remove the entire column from the dataset which contains a lot of missing values. However, it is important to know that if all the columns have some missing values, we might delete the crucial variables and their corresponding observation, which might be crucial for obtaining correct results. We can use myDataFrame.drop() and pass [‘Dependents’] and axis = 1 as a parameter.

Delete the column with missing values.
Delete the column with missing values.

Method 2: Impute the most frequent value

This approach imputes the missing values for the categorical features. We check the whole dataset and replace the missing value cells with the mode value. We import and use SimpleImputer from the sklearn.impute module in this method and pass the strategy as most_frequent to it as a parameter. It is mostly used when dealing with non-numerical values.

Let's take an example of a dataset that contains types of shapes but has a nan value in it. Let's print it as it is, and you will see that the missing value will be printed as NaN.

import pandas as pd
import numpy as np
X = pd.DataFrame({'Shape':['circle', 'square', 'rectangle', 'circle', np.nan, 'oval']})
print(X)

Now let's import SimpleImputer from the impute module and apply the fit_transform function on it to replace the missing value with the most frequently used value. Now you will notice that NaN will be replaced with the most frequently occurring value, which is a circle in this case.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
X = pd.DataFrame({'Shape':['circle', 'square', 'rectangle', 'circle', np.nan, 'oval']})
imputer = SimpleImputer(strategy='most_frequent')
print(imputer.fit_transform(X))

Method 3: Replace with a predicted value

This approach predicts the missing value based on the existing data and then replaces the missing values with it. There are two further sub-approaches through which we can do this.

  • Replace with an univariate statistic: In this approach, we only consider a single feature from the dataset. We import SimpleImputer from the sklearn.impute and replace the missing value with the mean of each column.

Let's take a look at the example below and see how the missing values are replaced.

import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 4], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))
  • Line 4: Create an instance of the SimpleImputer and set the strategy as mean in it, which means that we replace all the missing values with the mean value(s).

  • Line 5: Use the fit() method on the imputer and calculate the mean of each column. The calculated value is stored in the imputer's internal state.

  • Lines 7–8: Create a 2D array and use transform() to replace all the missing values in it with the mean value of the corresponding column and print the results.

  • Replace with a multivariate statistic: In this approach, we consider more than one feature from the dataset. We import IterativeImputer from the sklearn.impute and replace the missing value that corresponds to the other variables. In this case, one variable can influence the expected results for the other variable.

Let's take a look at the example below in which we use an existing dataset of Titanic to see how the missing values are replaced. The Age is missing for the sixth row in the original dataset but we predict the age based on the regression model that is built.

import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=7)
cols = ['SibSp', 'Fare', 'Age']
X = df[cols]
impute_it = IterativeImputer()
print(impute_it.fit_transform(X))
  • Line 5: Create a DataFrame instance df and read the first 7 rows of the Titanic dataset from the given URL and store them in it.

  • Lines 6–7: Select the columns from the dataset to be considered and store them in a new DataFrame X. In this case, we choose SibSp the number of siblings or spouses aboard, Fare of the ticket, and Age.

  • Line 9: Create an instance of IterativeImputer that will build a regression model with the SibSp and Fare variables and then make predictions for the Age column based on it.

  • Line 11: Use transform() to replace all the missing values in the Age column with the predicted values based on the regression model and print the results.

Method 4: Impute nearest neighbors

This approach is also a multivariate approach where the missing value is predicted based on the other corresponding variables in the dataset. In this case, we use apply Euclidean distance to find the nearest neighbors for the missing value. We import KNNImputer from the sklearn.impute and replace the missing value with the average of the Age values of the two rows that are closest to the row with the missing value. In this case, one variable can influence the expected results for the other variable.

Let's take a look at the example below, which uses the same Titanic dataset and columns as in the above example. Now let's find the missing Age in the sixth row, using KKNImputer.

from sklearn.impute import KNNImputer
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=7)
cols = ['SibSp', 'Fare', 'Age']
X = df[cols]
impute_knn = KNNImputer(n_neighbors=2)
print(impute_knn.fit_transform(X))
  • Line 4: Create a DataFrame instance df and read the first 7 rows of the Titanic dataset from the given URL and store them in it.

  • Lines 5–6: Select the columns from the dataset to be considered and store them in a new DataFrame X. In this case, we choose SibSp the number of siblings or spouses aboard, Fare of the ticket, and Age.

  • Line 8: Create an instance of KNNImputer and specify the number of neighbors to find. Here, we find two neighbors with the closest Fare to the Fare of the 6th row, which has a missing value. In this case, the 3rd and 5th rows have the closest values, so the average of the Age values in these two columns will be the value used to replace the missing value.

  • Line 10: Use transform() to replace the missing value in the Age column with the predicted values based on the obtained average of the specified number of neighbors.

Summary

There are various ways to handle the missing values in the datasets with their set of pros and cons. It is critical to reduce to handle the missing values to minimize the possibilities of potential biases in the machine learning models and achieve more precise and accurate results.

Test your understanding

Match The Answer
Select an option from the left-hand side

IterativeImputer

It considers a single feature because the the prediction made for it is independent of the other variables.

KNNImputer

It considers more than one feature because the other variables can influence the missing value prediction.

SimpleImputer


Copyright ©2024 Educative, Inc. All rights reserved