How to handle missing values in Pandas DataFrame

Overview

In this shot, we’ll learn about the missing data values in the datasets. The real-world data is never clean and contains many missing values. This usually happens due to many reasons.

Reasons for missing values

Some of the common reasons could be:

  • Human errors while joining multiple tables.
  • We could skip some data values.
  • The data generating source sometimes behave abnormally.

Hence, there is a need to handle the missing values while building machine learning models or performing data analytics.

Different approach

We’ll discuss two approaches to handle the missing values:

Delete the records having missing data

In this approach, we delete the entire record or row with missing values. Let’s understand this with an example:

main.py
data.csv
import pandas as pd
dataset = pd.read_csv('data.csv')
print('Data with missing values')
print(dataset)
print('\nData without missing values')
dataset.dropna(inplace = True)
print(dataset)

Explanation

  • Line 1: we import the required package.

  • Line 3: we read the dataset using the read_csv() method.

  • Line 6: We print the dataset. It can be observed in the output that few records contain NaN values, that is, missing values.

  • Line 9: We use the dropna() method on the dataset. This method deletes the records which have missing values. We also pass the parameter inplace to be True so that the records are deleted and the new dataset is updated in the same variable.

  • Line 10: We print the dataset. Now, no records have missing values.

Let’s now explore the second way to handle the missing values:

Replace the missing values with a default value

In this approach, we replace the missing values with a default value. In this case, no records are deleted. Let’s understand this with an example:

main.py
data.csv
import pandas as pd
dataset = pd.read_csv("data.csv")
print("Data with missing values")
print(dataset)
print("\nData with default values")
dataset.Salary.fillna("45000", inplace = True)
dataset.Company.fillna("Google", inplace = True)
dataset.Country.fillna("Australia", inplace = True)
print(dataset)

Explanation

The code is almost the same as above, with just one difference.

  • Lines 10 to 12: We use the fillna() method to fill the missing values with a default value.

Here, we have filled a value of 45000, Google, and Australia for the columns Salary, Company, and Country, respectively.

Also, a parameter passed as inplace to be True so that the operation is performed in the same dataset and the result is stored back in the same variable.

  • Line 14: We can observe that the missing values are now filled by the specified values using the fillna() method.