Data Cleaning with pandas

Learn how to clean data and handle missing values with pandas.

Identify null values

Real-world data isn’t always clean and tidy. It’s often messy, with unknown values, missing values, and records that don’t always make sense. There are several reasons why the data could have missing values, some of which include:

  • A possible case of human error where someone failed to input a value.
  • Data may be lost while being transferred from a database resource.
  • A programming error due to which specific columns’ values were missed.

When we need to visualize data, the numbers and records we have may not provide us with a clear image, or it may be challenging to discern patterns from that data through analysis and visualization. Therefore, constantly checking data for any irregularities, such as unknown or missing data, is crucial.

Let’s import a dataset using pandas and see if there are any missing values in it. Keep in mind that we are only using a small dataset here—in reality, data analysts work with much larger datasets and spend a significant amount of time cleaning data. Sometimes, the data is just messy and complicated.

So let’s practice!

Press + to interact
import pandas as pd
students_df = pd.read_csv('StudentData.csv')
print(students_df.head())

We use the is.null() method to check for missing values; it returns a DataFrame object with every value replaced with True for null values, and False for all other values. This gives us an overview of null values in the DataFrame. However, it requires much more time to check each null value.

Press + to interact
print(students_df.isnull())

A ...