In the machine learning cycle, data preparation is an important step towards exploring and analyzing data. It is essential to handle missing or corrupt data in order to have a clean data set that one can build accurate models on or draw concrete conclusions from. We will explore how missing or corrupted data can be handled. We will be using the Python pandas
module to demonstrate each of the methods highlighted below.
If data is missing, follow these steps:
null
or NaN
values) from the dataset. This means that you calculate the mean, median, or mode of each feature and replace missing values in a column with these statistics. Removing data is done when the missing data rows are very less in number and removing them from the dataset does not impact the data in a drastic manner. The disadvantage of this method is that you lose information. Below is an example of how you can do this using the dropna()
function in pandas
.import pandas as pdemployee_record = {"Names":["Samia", "Sana", "Ali"], "Age":[10,20,30], "Salary":[20000, 45000, None ] }# converting dictionary to dataframedf = pd.DataFrame(employee_record)print(df)#find null values and drop that rowfinal_dataset = df.dropna()print("After dropping rows with null values:\n")print(final_dataset)
null
values can be replaced by a relevant mean, median, or mode value. Imputation preserves data, compared to the first method where all values are deleted. This means that the column with missing data must be of numeric type so that we can replace it with these statistics. Imputation preserves data, compared to the first method where all values are deleted. However, the disadvantage is that we unknowingly add bias and variance to the dataset. Below is an example of how you can impute values using the replace
function in pandas
.import pandas as pdimport numpy as npemployee_record = {"Names":["Samia", "Sana", "Ali"], "Age":[10,20,30], "Salary":[20000, 45000, None ] }df = pd.DataFrame(employee_record)print(df)df['Salary'] = df['Salary'].replace(np.NaN, df['Salary'].mean())print("\n dataframe after replacement with mean of salary column")print(df)
replace
function in pandas
.import pandas as pdimport numpy as npemployee_record = {"Names":["Samia", "Sana", "Ali"], "Age":[10,20,30], "Salary":[20000, 45000, None ] }df = pd.DataFrame(employee_record)print(df)df['Salary'] = df['Salary'].replace(np.NaN, 0)print("\n dataframe after replacement with 0")print(df)
sklearn
module. First, we use the function KNNImputer
to set the imputer to the number of neighbors it wants to take into account. Then, we fit the data according to this imputer.import pandas as pdfrom sklearn.impute import KNNImputeremployee_record = {"ID":[10, 20, 30,40,50,60], "Age":[10,20,30,20,20,20], "Salary":[20000, 45000, None ,20000,20000,40000] }df = pd.DataFrame(employee_record)print(df)imptr = KNNImputer(n_neighbors=1)df = pd.DataFrame(imptr.fit_transform(df), columns=df.columns)print("\n After KNN imputing:")print(df)
Data is corrupted when it is entered incorrectly into a dataset, i.e., entering values that are out of a human range for an Age feature or entering categorical data for numerical values. To handle corrupted data, one can use similar methods as mentioned above after detecting that data is corrupt.
Free Resources