What is the summarize() method in R?

Key takeaways

  • Purpose: The summarize() function is used to summarize a DataFrame or vector into a single value.

  • Syntax: summarize(.data, ...) where ... represents summary functions like mean(), sum(), etc.

  • Parameters:

    • .data: DataFrame or tibble to summarize.

    • ...: Name-value pairs for summary calculations.

    • .by (optional): Directly specify columns to group by within summarize().

    • .groups (optional): Control the grouping structure in the result.

  • Variants:

    • summarize_all(): Applies a function to all columns.

    • summarize_at(): Applies a function to specific columns.

    • summarize_if(): Applies a function based on a condition.

  • Usage:

    • Ungrouped Data: Use summarize() directly to get overall summary statistics.

    • Grouped Data: Use summarize() with .by or group_by() to summarize data within groups.

The dplyr package by Wickham et al. is an R package for solving data manipulation challenges. One of these challenges is generating data summary statistics when using large data to extract meaningful insights. The dplyr package provides the summarize() function, which provides a summary based on operations performed on grouped or ungrouped data.

The summarize() function

The dplyr summarize() function in R summarizes a DataFrame or vector into a single value. For example, if we want to calculate the average of the input data, It will only return a row containing a value that represents the mean of the data.

The goal is to turn data into information, and information into insight.

—Carly Fiorina, CEO of HP (1999–2005)

The summarize() method fits perfectly with this goal, helping us turn raw data into meaningful information that can then be used to get in-depth insights into our research or analytical queries.

Syntax for the summarize() function

The following syntax is used for the summarize() function:

summarize(.data, ..., .by = NULL, .groups = NULL)
#OR
summarise(.data, ..., .by = NULL, .groups = NULL)

Both summarize() and summarise() in R are equivalent.

Parameters for the summarize() function

  • .data: It can be a DataFrame or tibble.

  • ...: This represents name-value pairs parameters and allow us to apply one or more summary calculations. For example, name can be the name of an output variable containing the summary and value can be function or expression that calculates a summary of the given data. The following table lists the useful functions:

Useful functions

Types of summary functions

Functions

Center

mean(), median()

Position

first(), last(), nth()

Range

min(), max()

Logical

any(), all()

Count

n(), n_distinct()

Spread

sd(), IQR(), mad()

  • .by (optional): Specifies columns to group by directly within summarize(), eliminating the need for a separate group_by() function.

  • .groups (optional): Specifies how to handle grouping after summarizing. It can be set to control how many grouping levels are retained in the output. The options are:

    • "drop_last": Drops the last level of grouping.

    • "drop": Drops all grouping levels.

    • "keep": Retains same grouping structure as .data.

    • "rowwise": Each row forms its own group.

Return values: The function must return a single value (a scalar) from the values in each group or from the entire dataset if no grouping is applied.

Code example for the summarize() function

Let's look at a basic example to understand the summarize() function. Here we use the PlantGrowth dataset in R, which contains two attributes: weight and group. Here, the weights are numeric columns and we want to calculate the overall average weight and maximum weight using summarize() function.

# Load library
library(dplyr, warn.conflicts = FALSE)
data <- PlantGrowth
# Applying summarize()
summarize(data, meanweight = mean(weight))
# Multiple statistics
summarize(data, meanweight = mean(weight), maxweight = max(weight) )

How to use summarize() based on data

There are two types of data on which we can apply summarize() data: grouped data and ungrouped data.

1. Summarize grouped data

One of the most useful aspects of summarize() is its ability to work with grouped data. By initially grouping the data, we can compute summary statistics for each group individually.

The operations that can be performed on grouped data are average, factor, count, mean, etc.

# Load library
library(dplyr, warn.conflicts = FALSE)
data <- PlantGrowth
#return unique value of group column
unique(data$group)
summarize(data, meanweight = mean(weight), .by = group)

In the example above, we use the summarize() function to obtain the mean weight of all the plant species in the PlantGrowth dataset.

Using group_by() to group data

group_by() can be used as an alternative to the .by parameter. The following example demonstrates the usage of the group_by() function with summarize() method.

Please note that the .groups parameter is used to manage the grouping structure of the result, but cannot be used together with the .by parameter. Therefore, we can use the .groups parameter with group_by() as it first groups the data and then applies the summarize() method.

# Load library
library(dplyr, warn.conflicts = FALSE)
data <- PlantGrowth
unique(data$group)
result <- data %>%
group_by(group) %>%
summarise(
mean_weight = mean(weight),
.groups = "drop" # Control the grouping structure in the result
)
print(result)

%>% is the pipe operator that passes the result of the left-hand side to the function on the right-hand side.

  • Line 9: Groups the data by the group column using group_by() function. Each group (e.g., ctrl, trt1, trt2) is treated separately for the summarization.

  • Line 11: Calculates the mean weight for each group.

  • Line 12: Drops the grouping structure from the final result. After summarization, the result is a simple data frame without any grouping metadata.

2. Summarize ungrouped data

To summarize ungrouped data, we can simply use the summarize() without .by or group_by(). Additionally, dplyr offers additional variations of summarize() to handle multiple columns efficiently, especially when working with ungrouped data. These variations are listed below:

  • summarize_all()
  • summarize_at()
  • summazrize_if()

The summarize_all() method

This function summarizes all the columns of data based on the action which is to be performed.

Syntax

summarize_all(.data, action)

Parameters

action: The function to apply on DataFrame columns. It can be either lambda or use funs().

Code

In the code snippet below, we load mtcars (Motor Trend US magazine dataset) in the data variable. In the variable sample, we are loading the top six observations to process. The sample %>% summarize_all(mean) will show the mean of the six observations in the result.

# Load dplyr library
library(dplyr, warn.conflicts = FALSE)
# Main code
data <- mtcars
# Loading starting 6 observations
sample <- head(data)
# Caculating mean value.
sample %>% summarize_all(mean)

The summarize_at() method

It performs the action on the specific column and generates the summary based on that action.

Syntax

summarize_at(.data, vector_of_columns, action)

Parameters
  • vector_of_columns: The list of column names or character vector of column names.
  • action: The function to apply on DataFrame columns. It can either be lambda or use funs().
Code

In the code snippet below, we load mtcars in the data variable. In the variable sample, we are loading the top six observations to process. The sample %>% group_by(hp) %>% summarize_at(c('cyl','mpg'),mean) will show the mean of the 'cyl' and 'mpg' observations in the result, grouping with hp (dataset feature/column name).

# Load dplyr library
library(dplyr, warn.conflicts = FALSE)
# Main code
data<-mtcars
sample <- head(data)
sample %>% group_by(hp) %>%
summarize_at(c('cyl','mpg'),mean)

The summarize_if() method

In this function, we specify a condition and the summary will be generated if the condition is satisfied.

Syntax

summarize_if(.data, .predicate, .action)

Parameters
  • predicate: A predicate function to apply to logical values or DataFrame columns.
  • action: The function to apply on DataFrame columns. It can either be lambda or use funs().

Note: A predicate function in R returns only True/False.

Code

In the code snippet below, we use the predicate function is.numeric and mean as an action.

# Laod dplyr librarry
library(dplyr, warn.conflicts = FALSE)
# Main code
data<-mtcars
z<- head(data)
z %>% group_by(hp) %>%
summarize_if(is.numeric, mean)

Here, is.numeric checks if a column is numeric. If yes, then it calculates the mean of each numeric column within each group defined by hp.

Let’s quickly assess your understanding of summarize() method by trying the following quiz:

1

How do we specify columns to group by when using summarize()?

A)

Using group_by()

B)

Using .by parameter

C)

Using .data parameter

D)

Both A and B

Question 1 of 50 attempted

Finally, R's summarize() method is an essential tool for data aggregation and analysis. It allows us to effectively aggregate data, resulting in clear and actionable insights.

Frequently asked questions

Haven’t found what you were looking for? Contact Us


What is data aggregation in R?

Data aggregation in R refers to the process of summarizing and combining data into a more concise format.


What are R data summarization techniques?

There are multiple summarization techniques in R such as functions like mean(), median(), sd(), sum(), summarize(), aggregate(), and summary() for summarizing data.


What is statistical summarization in R?

Statistical summarization in R involves generating summary statistics from data. This includes functions that calculate measures like mean, range, and standard deviation, helping in data analysis.


What is data manipulation with dplyr in R?

dplyr is a package designed for data manipulation. It provides a set of functions that make it easier to work with DataFrames by allowing us to perform operations like selecting, arranging, filtering, and summarizing data.


What does the R function summary() do?

The summary() function is used to summarize the data from a DataFrame and returns a statistical inference of the DataFrame such as mean, median, minimum, maximum, 1st and 3rd quartiles.


Free Resources