The groupby Function
Learn how to use Pandas groupby function for data analysis.
We'll cover the following
The lessons up to this point have covered data cleaning, manipulation, and processing with Pandas. Pandas is a great library for data analysis as well. In this chapter, we’ll go over Pandas functions that can be used to analyze tabular data.
Data analysis
Data analysis can be defined as the process of inferring insights, discovering useful information, and drawing results from the data at hand. It’s mainly done to support a decision-making process or to explore the data before creating a machine learning model.
In this lesson, we’ll use a new dataset called grocery
that contains the daily sales quantities of some products at a grocery store. Let’s take a quick look at this dataset.
import pandas as pdgrocery = pd.read_csv("grocery.csv")print("The size of the DataFrame:")print(grocery.shape)print("\nThe column names are:")print(list(grocery.columns))print("\nThe first five rows:")print(grocery.head())
One of the most commonly used functions in data analysis is the groupby
function. It groups observations (rows) according to the distinct values in a given column. Let’s say we have a DataFrame
that contains the sales information about the products in a retail store. Each product belongs to a product group, which is indicated in the product_group
column. By using the groupby
function, we can group the products based on the product groups they belong to. Then, we can calculate a wide range of aggregations, such as average product price, daily total sales, and so on.
How to use the groupby
function
The only requirement is to pass the name of the column used for grouping to the groupby
function. Then, the rows are grouped based on the distinct values in the given column. The following drawing illustrates how the groupby
function operates.
The groupby
function creates the groups, but they don’t provide any information unless we do some aggregations. For instance, in the illustration above, if we apply the mean
function to the price
column, we can calculate the average product price for each product group. Let’s do this operation on the grocery
.
import pandas as pdgrocery = pd.read_csv("grocery.csv")print(grocery.groupby("product_group").mean())
As we see in the output above, once the groups are formed and the mean function is applied, Pandas calculates the mean
value for all the numerical columns. So, we’re able to see the average sales quantities as well. Average product code is meaningless because the product code is just used as an identity.
If we’re only interested in the average price, we can select the columns before applying the groupby
function. For instance, in line 5 of the following block of code, we first select the product_group
and price
columns from the grocery
and then group the rows by the product_column
. Finally, the mean
function is applied to see the average price for each product group.
import pandas as pdgrocery = pd.read_csv("grocery.csv")print(grocery[["product_group","price"]].groupby("product_group").mean())
We now see only the average price of the products in each group.