Missing Data Detection and Calculations

Understand how to detect missing data and perform calculations involving them.

Detecting missing data

Before we can manage the missing (or null) values in our data, we need first to be able to detect them accurately. In pandas, we have several methods and functions at our disposal to detect missing values.

In the previous lesson, we learned that NaN isn’t considered equal to any value. It means that if we try to find missing data in Series or DataFrame objects by comparing the values with np.nan (e.g., using operators like == or >=), it will not work.

As such, we should instead use the functions in pandas to detect missing values across the different array data types, namely isnull() and notnull().

Note: Both isnull() and notnull() are described as functions in pandas, though they can also be used as methods with pandas objects such as a Series or DataFrame (e.g., df.isnull()).

Suppose we have a mock dataset of patient information, as shown below:

Patient Information Dataset with Missing Data

patient_id

Age

Gender

weight_kg

height_cm

cholesterol_mgdl

123

30

M

70

170

200

456

45

M

NaN

165

220

789

NaN

F

60

NaN

185

321

50

NaN

80

180

NaN

654

37

M

75

175

NaN

987

77

M

55

160

195

We can use isnull() to check whether the DataFrame contains missing values, as shown below:

Press + to interact
# Using isnull() as function to check for missing/null values in df
output = pd.isnull(df)
# View output
print(output)

We can see from the output that isnull() returns a boolean mask of the same shape as the DataFrame, where True indicates a missing value and False indicates a non-missing value. This helps us pinpoint the locations of the cells with missing data and serves as the base for subsequent processing.

On the other hand, notnull() returns the opposite mask, where True indicates a non-missing value, as shown below:

Press + to interact
# Using notnull() as function to check for non-missing values in df
output = pd.notnull(df)
# View output
print(output)

If we want to select all rows that contain at least one null value, we can ...

Get hands-on with 1400+ tech skills courses.