Applying Complex Row-by-Row Operations
Learn to perform complex row-by-row operations inside a tibble using rowwise and mutate.
We'll cover the following...
Often in a data science context, we’ll read some data and then need to add new data columns based on the existing data. For instance, we’re creating an additional column to classify fish as healthy, underweight, or overweight based on a formula that uses weight and length data already provided in the input data. Or if we’re working with grade data for students, maybe we need to add a column for maximum and minimum grades by student.
However, most tidyverse functions, like cor
or mean
, are intended to aggregate across rows of data; they’re column-wise aggregations. And that idea is consistent with the fact that we’re working with tidy data, so most of our aggregations will be column-wise, across our rows (observations) and not our columns (variables).
When working with tidy data, row-wise aggregations are most common in the data cleaning stages rather than the actual analytics. The functions of tidyverse are designed to work best when rows represent observations and columns represent variables. So, we’ll typically perform row-wise aggregations to create columns (a.k.a. variables) representing latent measurements—things we indirectly observed based on directly measuring other variables.
When calculating additional columns this way, the tidyverse has two essential functions that provide quite elegant solutions: mutate
and rowwise
. These two functions allow us to add columns (mutate
) to the existing tibble
and to do so using aggregations of the values in the same row (rowwise
).
One thing to be aware of is that base-R provides a function called apply
that allows us to achieve the same row-wise operation direction. If we search forums, we often see solutions referencing apply
, primarily because that function doesn’t require the tidyverse. However, apply
tends to make ...