Data Science in R: From Basics to Machine Learning/

...

Important Tidyverse Functions for Data Science

Learn some convenient data-manipulation functions used by data scientists in the tidyverse.

We'll cover the following...

Unique values and counts
Subsetting and shifting data
Cumulative aggregations and ranking
Dealing with missing values

Manipulating data to prepare it for more complex analysis will often make up a large chunk of our actual data science code. So we must get these data manipulation code lines right and ensure they meet our needs in an easily readable and understandable way. Fortunately, the tidyverse offers several convenient functions that directly handle many data-manipulation tasks. As a result, we’re often going to have several lines of code like the following:

OUT_Var  <- VAR_IN %>%
group_by(a column) %>%
summarise(OUT_Agg = statistical function(another column)

We can create statistical summaries of our data using summarise in addition to several lines of code where we use mutate to add latent measurements into a dataset.

OUT_Var  <- VAR_IN %>%
rowwise() %>%
mutate(OUT_Agg = statistical function(a set of columns)

And those statements will vary in complexity, length, and number depending on the project’s needs. So, the question of how to do something in the tidyverse often boils down to finding the function that fits within the structures above. In this lesson, we’ll dive into some specific functions that are particularly useful for data science in the context of data manipulation.

The function list covered here includes many of the standard tools in data science code. Having these functions at hand will enable us to quickly address many of our needs.

One function we’ll see frequently is distinct. The distinct function takes in a set of data columns and returns their unique combinations. If given just one column of data, it’ll return the unique values in that column. If given multiple columns of data—for instance, a whole tibble—it’ll return the unique rows in the data set. However, when feeding distinct multiple columns, we can specify its behavior by selecting which columns to consider in determining uniqueness by setting .keep_all, where .keep_all = TRUE will keep all columns from the input tibble in the output tibble. For example:

Press + to interact

Files

#load tidyverse libraries
library(ggplot2)
library(purrr)
library(tibble)
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
library(stringr)
library(readr)
library(forcats)
#Read in the grade dataset in long format
VAR_LongCourseData <- read_csv("GradeData-ByCourse.csv",
                            show_col_types = FALSE)
#Extract the Unique StudentIDs
VAR_UniqueStudentIDs <- VAR_LongCourseData %>%
                            distinct(StudentID,
                                .keep_all = FALSE)
 
#Extract the unique StudentIDs and keep the other data columns
#This keeps the first record for each unique StudentID
VAR_UniqueStudentIDAllColumns <- VAR_LongCourseData %>%
                                      distinct(StudentID,
                                            .keep_all = TRUE)
#Extract the unique combinations of StudentID and CourseID
VAR_UniqueStudentIDsAndCourseIDs <- VAR_LongCourseData %>%
                                      distinct(StudentID,
                                            CourseID,
                                            .keep_all = FALSE)
#Print the results
print("Unique stuident IDs")
VAR_UniqueStudentIDs
print("Unique student IDs, but keep all other columns")
VAR_UniqueStudentIDAllColumns
print("Unique combinations of Student ID and Course ID ")
VAR_UniqueStudentIDsAndCourseIDs

Why R?

R Fundamentals

R Fundamentals Exercises

Readable Coding with tidyverse

Tidyverse Exercises

Importing More Data Sources

Data Visualization with ggplot2

Best Practices for Data Scientists

Statistical Analysis and Machine Learning with tidymodels

Exploring tidymodels through Exercises

Useful Libraries for Data Science

Git Integration

Getting The Most Out of R

Appendix

Credit Card Fraud Detection using the R Language

Important Tidyverse Functions for Data Science

Unique values and counts