Efficient Coding Practices
Learn best practices for creating efficient code in R, including parallelization and vectorization.
Writing efficient code is critical to improving the speed and scalability of data science work. With large datasets and complex analytical models, inefficient code can lead to long wait times and cumbersome processes. This lesson discusses strategies for writing efficient code in R, including vectorization, avoiding unnecessary looping, and optimizing data structures and functions.
Avoiding unnecessary looping
Looping over data can be a slow and inefficient process in R. In many cases, there are ways to avoid unnecessary looping and perform the same operation more quickly. One way to do this is to ensure we leverage built-in functions and packages designed to handle large datasets, such as the tidyverse packages, in particular dplyr
or data.table
.
Whenever we’re tempted to build a loop, our first instinct should be to seek out the functions that more efficiently handle the requirement. Doing so can have a dramatic impact on our code’s performance! It’s often tempting to construct a loop when we encounter a new requirement that involves manipulating a set of rows or columns, but in most cases, there’s already a tidyverse function that’ll handle our need efficiently; it’s just a matter of finding it.
#load tidyverse librarieslibrary(ggplot2)library(purrr)library(tibble)library(dplyr, warn.conflicts = FALSE)library(tidyr)library(stringr)library(readr)#Bad example: Manipulating data with a for loop#Create an empty list to store the resultsVAR_IrisData <- as_tibble(iris)OUT_ResultsFor <- list()#Iterate over each species in the iris datasetfor (VAR_CurrSpecies in unique(VAR_IrisData$Species)) {#Subset the data for the current speciesVAR_SubsetData <- VAR_IrisData[VAR_IrisData$Species == VAR_CurrSpecies, ]#Calculate the mean of Sepal.Length for the current speciesVAR_MeanSepalLength <- mean(VAR_SubsetData$Sepal.Length)#Add the result to the listOUT_ResultsFor[[VAR_CurrSpecies]] <- VAR_MeanSepalLength}#Print the results for loopingpaste0("Looping results")OUT_ResultsFor#Good example: Manipulating data with tidyverse functions#Use group_by and summarize functions from dplyrOUT_ResultsTidy <- VAR_IrisData %>%group_by(Species) %>%summarize(MeanSepalLength = mean(Sepal.Length))#Print the results for tidyverse functionspaste0("Tidyverse results")OUT_ResultsTidy
In this example, we present two solutions for calculating the mean Sepal.Length
for different flower species in the iris
dataset. The first—unideal—solution, is to use a for
loop. The second much better solution, which is much better, is to use the tidyverse summarize
function.
-
Line 14: Create a
list
object to store the results of the meanSepal.Length
calculation. -
Line 20: Filter the dataset to the current
Species
being looped on. -
Line 23: Calculate the mean
Sepal.Length
for the currentSpecies
. -
Line 26: Add the result to our list object,
OUT_ResultsFor
. -
Lines 35–37: Carry out the same calculation, mean
Sepal.Length
bySpecies
, using tidyverse functions that remove the need for looping.
From this code example, notice that the two outputs are similar but not identical. The for
loop returns a list where each element is the mean Sepal.Length
for a particular flower species, while the tidyverse approach returns a tibble
containing the same results. Also note that the printing of the tibble
defaults to showing two decimal ...