Handling missing values is an essential aspect of data cleaning and preprocessing in machine learning tasks. There are a lot of techniques that can be used to handle missing data, depending on the specific use case and the sensitivity of missing data in the dataset.
Let's take a look at a few methods that can be used to handle the missing data, including deleting missing values, imputing the most frequent value, replacing with a predicted value, multiple imputation, and impute nearest neighbors.
This approach is the quickest but not the preferred one because there are high chances of ending up deleting important data. Generally, if the missing value is MNAR type we should not delete the data, but if it is MCAR or MAR, we can possibly delete it. There are two ways to delete the data containing missing values.
Delete the row with a missing value(s): It is a listwise deletion process in which we remove the entire row from the dataset which contains a lot of missing values. However, it is important to know that if all the rows have some missing data, we might delete the whole dataset. We can use myDataFrame.dropna()
and pass axis = 0
as a parameter.
Delete the column with a missing value(s): It is a variable-wise deletion process in which we remove the entire column from the dataset which contains a lot of missing values. However, it is important to know that if all the columns have some missing values, we might delete the crucial variables and their corresponding observation, which might be crucial for obtaining correct results. We can use myDataFrame.drop()
and pass [‘Dependents’]
and axis = 1
as a parameter.
This approach imputes the missing values for the categorical features. We check the whole dataset and replace the missing value cells with the mode value. We import and use SimpleImputer
from the sklearn.impute
module in this method and pass the strategy
as most_frequent
to it as a parameter. It is mostly used when dealing with non-numerical values.
Let's take an example of a dataset that contains types of shapes but has a nan
value in it. Let's print it as it is, and you will see that the missing value will be printed as NaN
.
import pandas as pdimport numpy as npX = pd.DataFrame({'Shape':['circle', 'square', 'rectangle', 'circle', np.nan, 'oval']})print(X)
Now let's import SimpleImputer
from the impute module and apply the fit_transform
function on it to replace the missing value with the most frequently used value. Now you will notice that NaN
will be replaced with the most frequently occurring value, which is a circle
in this case.
import pandas as pdimport numpy as npfrom sklearn.impute import SimpleImputerX = pd.DataFrame({'Shape':['circle', 'square', 'rectangle', 'circle', np.nan, 'oval']})imputer = SimpleImputer(strategy='most_frequent')print(imputer.fit_transform(X))
This approach predicts the missing value based on the existing data and then replaces the missing values with it. There are two further sub-approaches through which we can do this.
Replace with an univariate statistic: In this approach, we only consider a single feature from the dataset. We import SimpleImputer
from the sklearn.impute
and replace the missing value with the mean of each column.
Let's take a look at the example below and see how the missing values are replaced.
import numpy as npfrom sklearn.impute import SimpleImputerimp = SimpleImputer(missing_values=np.nan, strategy='mean')imp.fit([[1, 2], [np.nan, 4], [7, 6]])X = [[np.nan, 2], [6, np.nan], [7, 6]]print(imp.transform(X))
Line 4: Create an instance of the SimpleImputer
and set the strategy as mean in it, which means that we replace all the missing values with the mean value(s).
Line 5: Use the fit()
method on the imputer and calculate the mean of each column. The calculated value is stored in the imputer's internal state.
Lines 7–8: Create a 2D array and use transform()
to replace all the missing values in it with the mean value of the corresponding column and print the results.
Replace with a multivariate statistic: In this approach, we consider more than one feature from the dataset. We import IterativeImputer
from the sklearn.impute
and replace the missing value that corresponds to the other variables. In this case, one variable can influence the expected results for the other variable.
Let's take a look at the example below in which we use an existing dataset of Titanic to see how the missing values are replaced. The Age
is missing for the sixth row in the original dataset but we predict the age based on the regression model that is built.
import pandas as pdfrom sklearn.experimental import enable_iterative_imputerfrom sklearn.impute import IterativeImputerdf = pd.read_csv('http://bit.ly/kaggletrain', nrows=7)cols = ['SibSp', 'Fare', 'Age']X = df[cols]impute_it = IterativeImputer()print(impute_it.fit_transform(X))
Line 5: Create a DataFrame instance df
and read the first 7 rows of the Titanic dataset from the given URL and store them in it.
Lines 6–7: Select the columns from the dataset to be considered and store them in a new DataFrame X
. In this case, we choose SibSp
the number of siblings or spouses aboard, Fare
of the ticket, and Age
.
Line 9: Create an instance of IterativeImputer that will build a regression model with the SibSp and Fare variables and then make predictions for the Age column based on it.
Line 11: Use transform()
to replace all the missing values in the Age
column with the predicted values based on the regression model and print the results.
This approach is also a multivariate approach where the missing value is predicted based on the other corresponding variables in the dataset. In this case, we use apply Euclidean distance to find the nearest neighbors for the missing value. We import KNNImputer
from the sklearn.impute
and replace the missing value with the average of the Age values of the two rows that are closest to the row with the missing value. In this case, one variable can influence the expected results for the other variable.
Let's take a look at the example below, which uses the same Titanic dataset and columns as in the above example. Now let's find the missing Age
in the sixth row, using KKNImputer
.
from sklearn.impute import KNNImputerimport pandas as pddf = pd.read_csv('http://bit.ly/kaggletrain', nrows=7)cols = ['SibSp', 'Fare', 'Age']X = df[cols]impute_knn = KNNImputer(n_neighbors=2)print(impute_knn.fit_transform(X))
Line 4: Create a DataFrame instance df
and read the first 7 rows of the Titanic dataset from the given URL and store them in it.
Lines 5–6: Select the columns from the dataset to be considered and store them in a new DataFrame X
. In this case, we choose SibSp
the number of siblings or spouses aboard, Fare
of the ticket, and Age
.
Line 8: Create an instance of KNNImputer
and specify the number of neighbors to find. Here, we find two neighbors with the closest Fare
to the Fare
of the 6th row, which has a missing value. In this case, the 3rd and 5th rows have the closest values, so the average of the Age
values in these two columns will be the value used to replace the missing value.
Line 10: Use transform()
to replace the missing value in the Age
column with the predicted values based on the obtained average of the specified number of neighbors.
There are various ways to handle the missing values in the datasets with their set of pros and cons. It is critical to reduce to handle the missing values to minimize the possibilities of potential biases in the machine learning models and achieve more precise and accurate results.
IterativeImputer
It considers a single feature because the the prediction made for it is independent of the other variables.
KNNImputer
It considers more than one feature because the other variables can influence the missing value prediction.
SimpleImputer