...
/Important Tidyverse Functions for Data Science
Important Tidyverse Functions for Data Science
Learn some convenient data-manipulation functions used by data scientists in the tidyverse.
Manipulating data to prepare it for more complex analysis will often make up a large chunk of our actual data science code. So we must get these data manipulation code lines right and ensure they meet our needs in an easily readable and understandable way. Fortunately, the tidyverse offers several convenient functions that directly handle many data-manipulation tasks. As a result, we’re often going to have several lines of code like the following:
OUT_Var <- VAR_IN %>%
group_by(a column) %>%
summarise(OUT_Agg = statistical function(another column)
We can create statistical summaries of our data using summarise
in addition to several lines of code where we use mutate
to add latent measurements into a dataset.
OUT_Var <- VAR_IN %>%
rowwise() %>%
mutate(OUT_Agg = statistical function(a set of columns)
And those statements will vary in complexity, length, and number depending on the project’s needs. So, the question of how to do something in the tidyverse often boils down to finding the function that fits within the structures above. In this lesson, we’ll dive into some specific functions that are particularly useful for data science in the context of data manipulation.
The function list covered here includes many of the standard tools in data science code. Having these functions at hand will enable us to quickly address many of our needs.
Unique values and counts
One function we’ll see frequently is distinct
. The distinct
function takes in a set of data columns and returns their unique combinations. If given just one column of data, it’ll return the unique values in that column. If given multiple columns of data—for instance, a whole tibble
—it’ll return the unique rows in the data set. However, when feeding distinct
multiple columns, we can specify its behavior by selecting which columns to consider in determining uniqueness by setting .keep_all
, where .keep_all = TRUE
will keep all columns from the input tibble
in the output tibble
. For example:
#load tidyverse librarieslibrary(ggplot2)library(purrr)library(tibble)suppressPackageStartupMessages(library(dplyr))library(tidyr)library(stringr)library(readr)library(forcats)#Read in the grade dataset in long formatVAR_LongCourseData <- read_csv("GradeData-ByCourse.csv",show_col_types = FALSE)#Extract the Unique StudentIDsVAR_UniqueStudentIDs <- VAR_LongCourseData %>%distinct(StudentID,.keep_all = FALSE)#Extract the unique StudentIDs and keep the other data columns#This keeps the first record for each unique StudentIDVAR_UniqueStudentIDAllColumns <- VAR_LongCourseData %>%distinct(StudentID,.keep_all = TRUE)#Extract the unique combinations of StudentID and CourseIDVAR_UniqueStudentIDsAndCourseIDs <- VAR_LongCourseData %>%distinct(StudentID,CourseID,.keep_all = FALSE)#Print the resultsprint("Unique stuident IDs")VAR_UniqueStudentIDsprint("Unique student IDs, but keep all other columns")VAR_UniqueStudentIDAllColumnsprint("Unique combinations of Student ID and Course ID ")VAR_UniqueStudentIDsAndCourseIDs
In this example, we extract different sets of unique identifiers from the GradeData-ByCourse.csv
dataset:
-
Lines 16–18: Extract the unique
StudentID
s fromVAR_LongCourseData
. -
Lines 22–24: Extract the unique
StudentID
s fromVAR_LongCourseData
but keep all columns in the incoming dataset (specified by.keep_all
...