...

/

Important Tidyverse Functions for Data Science

Important Tidyverse Functions for Data Science

Learn some convenient data-manipulation functions used by data scientists in the tidyverse.

Manipulating data to prepare it for more complex analysis will often make up a large chunk of our actual data science code. So we must get these data manipulation code lines right and ensure they meet our needs in an easily readable and understandable way. Fortunately, the tidyverse offers several convenient functions that directly handle many data-manipulation tasks. As a result, we’re often going to have several lines of code like the following:

OUT_Var  <- VAR_IN %>%
group_by(a column) %>%
summarise(OUT_Agg = statistical function(another column)

We can create statistical summaries of our data using summarise in addition to several lines of code where we use mutate to add latent measurements into a dataset.

OUT_Var  <- VAR_IN %>%
rowwise() %>%
mutate(OUT_Agg = statistical function(a set of columns)

And those statements will vary in complexity, length, and number depending on the project’s needs. So, the question of how to do something in the tidyverse often boils down to finding the function that fits within the structures above. In this lesson, we’ll dive into some specific functions that are particularly useful for data science in the context of data manipulation.

The function list covered here includes many of the standard tools in data science code. Having these functions at hand will enable us to quickly address many of our needs.

Unique values and counts

One function we’ll see frequently is distinct. The distinct function takes in a set of data columns and returns their unique combinations. If given just one column of data, it’ll return the unique values in that column. If given multiple columns of data—for instance, a whole tibble—it’ll return the unique rows in the data set. However, when feeding distinct multiple columns, we can specify its behavior by selecting which columns to consider in determining uniqueness by setting .keep_all, where .keep_all = TRUE will keep all columns from the input tibble in the output tibble. For example:

Press + to interact
main.R
GradeData-ByCourse.csv
#load tidyverse libraries
library(ggplot2)
library(purrr)
library(tibble)
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
library(stringr)
library(readr)
library(forcats)
#Read in the grade dataset in long format
VAR_LongCourseData <- read_csv("GradeData-ByCourse.csv",
show_col_types = FALSE)
#Extract the Unique StudentIDs
VAR_UniqueStudentIDs <- VAR_LongCourseData %>%
distinct(StudentID,
.keep_all = FALSE)
#Extract the unique StudentIDs and keep the other data columns
#This keeps the first record for each unique StudentID
VAR_UniqueStudentIDAllColumns <- VAR_LongCourseData %>%
distinct(StudentID,
.keep_all = TRUE)
#Extract the unique combinations of StudentID and CourseID
VAR_UniqueStudentIDsAndCourseIDs <- VAR_LongCourseData %>%
distinct(StudentID,
CourseID,
.keep_all = FALSE)
#Print the results
print("Unique stuident IDs")
VAR_UniqueStudentIDs
print("Unique student IDs, but keep all other columns")
VAR_UniqueStudentIDAllColumns
print("Unique combinations of Student ID and Course ID ")
VAR_UniqueStudentIDsAndCourseIDs

In this example, we extract different sets of unique identifiers from the GradeData-ByCourse.csv dataset:

  • Lines 16–18: Extract the unique StudentIDs from VAR_LongCourseData.

  • Lines 22–24: Extract the unique StudentIDs from VAR_LongCourseData but keep all columns in the incoming dataset (specified by .keep_all ...