Filtering Data

This lesson focuses on how to filter data with Pandas.

We'll cover the following...

Filtering

Filtering is the process of extracting a subset of your data based on some condition or constraint. These conditions can be on the values that the data items take. We filter data when we wish to look at a smaller part of the whole data. For instance, we may want:

  • the data in a particular period of the year
  • the data of the highest selling items
  • the data for a specific group of items
  • to remove extra or useless data

Data filtering is done on almost every dataset before doing any analysis. Let’s look at some examples using our California Housing Dataset.

housing.csv
Press + to interact
import pandas as pd
df = pd.read_csv('housing.csv')
print(df.head())

Let’s say we want to see the data for all the housing blocks that are close to the ocean. From the above code block, we know that there is text in the ocean_proximity column instead of numbers. We will first find out how many distinct values there are in this column and then decide how to filter rows for our requirement.

Press + to interact
import pandas as pd
df = pd.read_csv('housing.csv')
# Find all distinct values in ocean_proximity
unique_values = df['ocean_proximity'].unique()
print(unique_values)

We have used the function unique() on the ocean_proximity column in line 5 to obtain all the unique values in this ...