How to handle datasets with missing or corrupted data

In the machine learning cycle, data preparation is an important step towards exploring and analyzing data. It is essential to handle missing or corrupt data in order to have a clean data set that one can build accurate models on or draw concrete conclusions from. We will explore how missing or corrupted data can be handled. We will be using the Python pandas module to demonstrate each of the methods highlighted below.

Handling missing data

If data is missing, follow these steps:

  1. Remove data: You can remove the rows with missing data (null or NaN values) from the dataset. This means that you calculate the mean, median, or mode of each feature and replace missing values in a column with these statistics. Removing data is done when the missing data rows are very less in number and removing them from the dataset does not impact the data in a drastic manner. The disadvantage of this method is that you lose information. Below is an example of how you can do this using the dropna() function in pandas.
import pandas as pd
employee_record = {"Names":["Samia", "Sana", "Ali"], "Age":[10,20,30], "Salary":[20000, 45000, None ] }
# converting dictionary to dataframe
df = pd.DataFrame(employee_record)
#find null values and drop that row
final_dataset = df.dropna()
print("After dropping rows with null values:\n")
  1. Impute with mean, median, or mode: The null values can be replaced by a relevant mean, median, or mode value. Imputation preserves data, compared to the first method where all values are deleted. This means that the column with missing data must be of numeric type so that we can replace it with these statistics. Imputation preserves data, compared to the first method where all values are deleted. However, the disadvantage is that we unknowingly add bias and variance to the dataset. Below is an example of how you can impute values using the replace function in pandas.
import pandas as pd
import numpy as np
employee_record = {"Names":["Samia", "Sana", "Ali"], "Age":[10,20,30], "Salary":[20000, 45000, None ] }
df = pd.DataFrame(employee_record)
df['Salary'] = df['Salary'].replace(np.NaN, df['Salary'].mean())
print("\n dataframe after replacement with mean of salary column")
  1. Impute with zero, constant, or recent value: The imputation in method 2 can also be tweaked to work for categorical data. For categorical data, you can replace missing values with a constant value, zero value, or even the most frequent value in the column in relation to the missing value. However, this is not a very accurate approach and can also introduce variance and bias to data. Below is an example of how you can do this using the replace function in pandas.
import pandas as pd
import numpy as np
employee_record = {"Names":["Samia", "Sana", "Ali"], "Age":[10,20,30], "Salary":[20000, 45000, None ] }
df = pd.DataFrame(employee_record)
df['Salary'] = df['Salary'].replace(np.NaN, 0)
print("\n dataframe after replacement with 0")
  1. Impute using k-nearest neighbors: This method can impute values based on k-nearest neighbors in that column. It calculates a weighted average of its k-nearest neighbors and replaces the missing values. The k-nearest neighbors in a dataset are found using the Euclidean distance between each data point. It can take a lot of time to apply the KNN machine learning algorithm to the data, calculate values for each of the missing values, and replace them. Below is an example of how you can do this using the sklearn module. First, we use the function KNNImputer to set the imputer to the number of neighbors it wants to take into account. Then, we fit the data according to this imputer.
import pandas as pd
from sklearn.impute import KNNImputer
employee_record = {"ID":[10, 20, 30,40,50,60], "Age":[10,20,30,20,20,20], "Salary":[20000, 45000, None ,20000,20000,40000] }
df = pd.DataFrame(employee_record)
imptr = KNNImputer(n_neighbors=1)
df = pd.DataFrame(imptr.fit_transform(df), columns=df.columns)
print("\n After KNN imputing:")
  1. Predicting values: You can predict the values for a feature with missing values by using linear regression on the rest of the features. Once a column of predicted values is generated for the feature, all the missing values in that column can be replaced.

Handling corrupted data

Data is corrupted when it is entered incorrectly into a dataset, i.e., entering values that are out of a human range for an Age feature or entering categorical data for numerical values. To handle corrupted data, one can use similar methods as mentioned above after detecting that data is corrupt.

