In this shot, we’ll learn about the missing data values in the datasets. The real-world data is never clean and contains many missing values. This usually happens due to many reasons.
Some of the common reasons could be:
Hence, there is a need to handle the missing values while building machine learning models or performing data analytics.
We’ll discuss two approaches to handle the missing values:
In this approach, we delete the entire record or row with missing values. Let’s understand this with an example:
import pandas as pddataset = pd.read_csv('data.csv')print('Data with missing values')print(dataset)print('\nData without missing values')dataset.dropna(inplace = True)print(dataset)
Line 1: we import the required package.
Line 3: we read the dataset using the read_csv()
method.
Line 6: We print the dataset. It can be observed in the output that few records contain NaN
values, that is, missing values.
Line 9: We use the dropna()
method on the dataset. This method deletes the records which have missing values. We also pass the parameter inplace
to be True
so that the records are deleted and the new dataset is updated in the same variable.
Line 10: We print the dataset. Now, no records have missing values.
Let’s now explore the second way to handle the missing values:
In this approach, we replace the missing values with a default value. In this case, no records are deleted. Let’s understand this with an example:
import pandas as pddataset = pd.read_csv("data.csv")print("Data with missing values")print(dataset)print("\nData with default values")dataset.Salary.fillna("45000", inplace = True)dataset.Company.fillna("Google", inplace = True)dataset.Country.fillna("Australia", inplace = True)print(dataset)
The code is almost the same as above, with just one difference.
fillna()
method to fill the missing values with a default value.Here, we have filled a value of 45000
, Google
, and Australia
for the columns Salary
, Company
, and Country
, respectively.
Also, a parameter passed as inplace
to be True
so that the operation is performed in the same dataset and the result is stored back in the same variable.
fillna()
method.