Filtering Datasets
Learn to analyze data subsets using filters in the tidyverse.
We'll cover the following...
In data science work, we often need to filter
or subset data. Frequently, we’ll want to analyze a subset of the data given to us based on some condition that we can check within the dataset itself. For example, in a student dataset, we want to look at average grades for students only in a particular year or a specific course. Then we’ll need to filter
the data to view the relevant records only.
Using filter
Filters in the tidyverse are applied similarly to group_by
statements. In the example below, we use filter
to subset student grade data contained in the attached csv
files. The file StudentInformation.csv
contains general information regarding students, while the file GradeData-byCourse.csv
contains the students’ grades (Grade
) for each course (CourseID
).
#Load tidyverse librarieslibrary(ggplot2)library(purrr)library(tibble)suppressPackageStartupMessages(library(dplyr))library(tidyr)library(stringr)library(readr)library(forcats)#Load datasets directly to tibblesVAR_StudentData <- read_csv("StudentInformation.csv",col_names = TRUE,skip = 0,n_max = Inf,show_col_types = FALSE)VAR_GradeDataByCourse <- read_csv("GradeData-ByCourse.csv",col_names = TRUE,skip = 0,n_max = Inf,show_col_types = FALSE)#Join the two data setsVAR_CombinedStudentData <- VAR_StudentData %>%full_join(VAR_GradeDataByCourse,by = "StudentID", multiple = "all")#Filter the combined data set to the MATH101 courseVAR_CombinedDataMath101 <- VAR_CombinedStudentData %>%filter(CourseID == "MATH101")#Join and filter the data sets in a single commandVAR_CombinedDataMath101Piped <- VAR_StudentData %>%full_join(VAR_GradeDataByCourse,by = "StudentID", multiple = "all") %>%filter(CourseID == "MATH101")#Output resultspaste0("Multi-step combination of data")VAR_CombinedDataMath101paste0("Single step combination using pipe")VAR_CombinedDataMath101Piped
-
Lines 25–31: We add a single line here, the
filter
command. ...