Data aggregation in R refers to the process of summarizing and combining data into a more concise format.
Key takeaways
Purpose: The
summarize()
function is used to summarize a DataFrame or vector into a single value.Syntax:
summarize(.data, ...)
where...
represents summary functions likemean()
,sum()
, etc.Parameters:
.data
: DataFrame or tibble to summarize.
...
: Name-value pairs for summary calculations.
.by
(optional): Directly specify columns to group by withinsummarize()
.
.groups
(optional): Control the grouping structure in the result.Variants:
summarize_all()
: Applies a function to all columns.
summarize_at()
: Applies a function to specific columns.
summarize_if()
: Applies a function based on a condition.Usage:
Ungrouped Data: Use
summarize()
directly to get overall summary statistics.Grouped Data: Use
summarize()
with.by
orgroup_by()
to summarize data within groups.
The dplyr
package by Wickham et al. is an R package for solving data manipulation challenges. One of these challenges is generating data summary statistics when using large data to extract meaningful insights. The dplyr
package provides the summarize()
function, which provides a summary based on operations performed on grouped or ungrouped data.
summarize()
functionThe dplyr
summarize()
function in R summarizes a DataFrame or vector into a single value. For example, if we want to calculate the average of the input data, It will only return a row containing a value that represents the mean of the data.
The goal is to turn data into information, and information into insight.
—Carly Fiorina, CEO of HP (1999–2005)
The summarize()
method fits perfectly with this goal, helping us turn raw data into meaningful information that can then be used to get in-depth insights into our research or analytical queries.
summarize()
functionThe following syntax is used for the summarize()
function:
summarize(.data, ..., .by = NULL, .groups = NULL)#ORsummarise(.data, ..., .by = NULL, .groups = NULL)
Both summarize()
and summarise()
in R are equivalent.
summarize()
function.data
: It can be a DataFrame or tibble.
...
: This represents name-value pairs parameters and allow us to apply one or more summary calculations. For example, name can be the name of an output variable containing the summary and value can be function or expression that calculates a summary of the given data. The following table lists the useful functions:
Types of summary functions | Functions |
Center |
|
Position |
|
Range |
|
Logical |
|
Count |
|
Spread |
|
.by
(optional): Specifies columns to group by directly within summarize()
, eliminating the need for a separate group_by()
function.
.groups
(optional): Specifies how to handle grouping after summarizing. It can be set to control how many grouping levels are retained in the output. The options are:
"drop_last"
: Drops the last level of grouping.
"drop"
: Drops all grouping levels.
"keep"
: Retains same grouping structure as .data
.
"rowwise"
: Each row forms its own group.
Return values: The function must return a single value (a scalar) from the values in each group or from the entire dataset if no grouping is applied.
summarize()
functionLet's look at a basic example to understand the summarize()
function. Here we use the PlantGrowth
dataset in R, which contains two attributes: weight and group. Here, the weights are numeric columns and we want to calculate the overall average weight and maximum weight using summarize()
function.
# Load librarylibrary(dplyr, warn.conflicts = FALSE)data <- PlantGrowth# Applying summarize()summarize(data, meanweight = mean(weight))# Multiple statisticssummarize(data, meanweight = mean(weight), maxweight = max(weight) )
summarize()
based on dataThere are two types of data on which we can apply summarize()
data: grouped data and ungrouped data.
One of the most useful aspects of summarize()
is its ability to work with grouped data. By initially grouping the data, we can compute summary statistics for each group individually.
The operations that can be performed on grouped data are average
, factor
, count
, mean
, etc.
# Load librarylibrary(dplyr, warn.conflicts = FALSE)data <- PlantGrowth#return unique value of group columnunique(data$group)summarize(data, meanweight = mean(weight), .by = group)
In the example above, we use the summarize()
function to obtain the mean weight of all the plant species in the PlantGrowth
dataset.
group_by()
to group datagroup_by()
can be used as an alternative to the .by
parameter. The following example demonstrates the usage of the group_by()
function with summarize()
method.
Please note that the .groups
parameter is used to manage the grouping structure of the result, but cannot be used together with the .by
parameter. Therefore, we can use the .groups
parameter with group_by()
as it first groups the data and then applies the summarize()
method.
# Load librarylibrary(dplyr, warn.conflicts = FALSE)data <- PlantGrowthunique(data$group)result <- data %>%group_by(group) %>%summarise(mean_weight = mean(weight),.groups = "drop" # Control the grouping structure in the result)print(result)
%>%
is the pipe operator that passes the result of the left-hand side to the function on the right-hand side.
Line 9: Groups the data by the group
column using group_by()
function. Each group (e.g., ctrl
, trt1
, trt2
) is treated separately for the summarization.
Line 11: Calculates the mean weight for each group.
Line 12: Drops the grouping structure from the final result. After summarization, the result is a simple data frame without any grouping metadata.
To summarize ungrouped data, we can simply use the summarize()
without .by
or group_by()
. Additionally, dplyr
offers additional variations of summarize()
to handle multiple columns efficiently, especially when working with ungrouped data. These variations are listed below:
summarize_all()
summarize_at()
summazrize_if()
summarize_all()
methodThis function summarizes all the columns of data based on the action which is to be performed.
summarize_all(.data, action)
action
: The function to apply on DataFrame columns. It can be either lambda or use funs()
.
In the code snippet below, we load mtcars
(Motor Trend US magazine dataset) in the data
variable. In the variable sample
, we are loading the top six observations to process. The sample %>% summarize_all(mean)
will show the mean of the six observations in the result.
# Load dplyr librarylibrary(dplyr, warn.conflicts = FALSE)# Main codedata <- mtcars# Loading starting 6 observationssample <- head(data)# Caculating mean value.sample %>% summarize_all(mean)
summarize_at()
methodIt performs the action on the specific column and generates the summary based on that action.
summarize_at(.data, vector_of_columns, action)
vector_of_columns
: The list of column names or character vector of column names.action
: The function to apply on DataFrame columns. It can either be lambda or use funs()
.In the code snippet below, we load mtcars
in the data
variable. In the variable sample
, we are loading the top six observations to process. The sample %>% group_by(hp) %>% summarize_at(c('cyl','mpg'),mean)
will show the mean of the 'cyl'
and 'mpg'
observations in the result, grouping with hp
(dataset feature/column name).
# Load dplyr librarylibrary(dplyr, warn.conflicts = FALSE)# Main codedata<-mtcarssample <- head(data)sample %>% group_by(hp) %>%summarize_at(c('cyl','mpg'),mean)
summarize_if()
methodIn this function, we specify a condition and the summary will be generated if the condition is satisfied.
summarize_if(.data, .predicate, .action)
predicate
: A predicate function to apply to logical values or DataFrame columns.action
: The function to apply on DataFrame columns. It can either be lambda or use funs()
.Note: A predicate function in R returns only True/False.
In the code snippet below, we use the predicate
function is.numeric
and mean
as an action.
# Laod dplyr librarrylibrary(dplyr, warn.conflicts = FALSE)# Main codedata<-mtcarsz<- head(data)z %>% group_by(hp) %>%summarize_if(is.numeric, mean)
Here, is.numeric
checks if a column is numeric. If yes, then it calculates the mean of each numeric column within each group defined by hp
.
Let’s quickly assess your understanding of summarize()
method by trying the following quiz:
How do we specify columns to group by when using summarize()
?
Using group_by()
Using .by
parameter
Using .data
parameter
Both A and B
Finally, R's summarize()
method is an essential tool for data aggregation and analysis. It allows us to effectively aggregate data, resulting in clear and actionable insights.
Haven’t found what you were looking for? Contact Us