Filtering

Learn how to use where() and mask() for filtering and replacing data based on boolean indexing.

Recap of boolean indexing

Before we dive into filtering numerical values with the pandas methods of where() and mask(), it’ll be good to revisit the concept of boolean indexing. Boolean indexing is the technique of selecting data from a DataFrame based on an array of True/False values so that only the elements from the original data, where the corresponding element in the mask is True, are selected.

This array of True/False values is known as a boolean mask and has the same shape as the original data. The True or False values in the boolean mask are determined by the specific criteria we define. For example, we have the following subset of the credit card dataset, and we set a condition for numerical values to be less than 40:

Press + to interact
# Display original DataFrame
print('Original DataFrame')
print(df)
print('=' * 50)
# Get boolean mask
bool_mask = [df < 40]
print('Boolean Mask')
print(bool_mask)
print('=' * 50)
# Apply condition of numerical values < 40 on entire DataFrame
output = df[df < 40]
print('Filtered DataFrame')
print(output)

The output above displays two results:

  • A boolean mask with the same shape as the original DataFrame.

  • A filtered DataFrame where numerical values that meet the criteria (i.e., have a True value in the corresponding boolean mask) are retained, while the elements that don’t meet the criteria are replaced with NaN values.

Filtering with where()

Building on the concept of boolean indexing, the where() method allows us to ...